Multimodal Memory

EverOS turns non-text content — images, PDFs, audio, office documents, HTML, email — into the same structured, searchable memory as plain text. Attach the asset to a message at ingest time; a vision/audio-capable LLM parses it into text, and from there it flows through the identical extraction → markdown → index pipeline as any text turn. The result is fully retrievable with the same /search stack.

How It Works

POST /api/v1/memory/add
  messages[].content = [ ContentItem, ContentItem, ... ]
        │
        ├─ text items      → used verbatim
        └─ non-text items  → multimodal LLM (everalgo-parser)
                ↓
  parsed text merged back into the session buffer (in original order)
        ↓
  boundary detector → extraction LLM → MemCell
        ↓
  markdown (truth)  +  SQLite (state)  +  LanceDB (vector + BM25)
        ↓
  retrievable via /search and /get like any text memory

Each non-text ContentItem is routed through the parser, which calls a separate, vision/audio-capable LLM configured independently from the main extraction [llm] — so parsing can target a multimodal endpoint without changing boundary or extraction behaviour. Visual/audio formats (image / pdf / audio / office) always go through that LLM; a few text-bearing formats (e.g. a plain email with no inline images) can be parsed without it. The parser returns text that takes the place of the asset in the message buffer — nothing downstream knows or cares that the content originated as an image or PDF. The raw bytes are not persisted past extraction — only the parsed text is stored.

Prerequisites

Install the multimodal extra

Multimodal parsing lives behind an optional dependency group so the base install stays lean:

pip install 'everos[multimodal]'
# or with uv:
uv pip install 'everos[multimodal]'

This pulls in everalgo-parser[svg] — the [svg] bundle adds cairosvg so SVG works out of the box.

Install LibreOffice (office documents only)

Office formats (.doc / .docx / .ppt / .pptx / .xls / .xlsx) are converted to PDF before being fed to the multimodal LLM. LibreOffice must be present on the server host:

brew install --cask libreoffice          # macOS
sudo apt-get install -y libreoffice      # Debian / Ubuntu

Without LibreOffice, office uploads return 503 (CAPABILITY_UNAVAILABLE). Image, PDF, audio, HTML, and email parsing are unaffected.

Configure the multimodal LLM

The parser uses its own LLM section, independent from [llm]. The model must accept OpenAI image_url parts. everos init writes these into the generated .env:

EVEROS_MULTIMODAL__MODEL=google/gemini-3-flash-preview
EVEROS_MULTIMODAL__API_KEY=<your key>
EVEROS_MULTIMODAL__BASE_URL=https://openrouter.ai/api/v1

The default targets Gemini via OpenRouter so a single key covers both chat extraction and multimodal parsing.

Supported Content Types

`type`	Typical formats	Payload	Notes
`text`	—	`text`	Plain text; a bare string `content` also maps here
`image`	PNG / JPG / GIF / WebP / SVG	`uri` or `base64`	SVG via bundled `cairosvg`
`pdf`	PDF	`uri` or `base64`	—
`audio`	MP3 / WAV / …	`uri` or `base64`	Endpoint must accept audio parts
`doc`	DOC / DOCX / PPT / PPTX / XLS / XLSX	`uri` or `base64`	Requires LibreOffice
`html`	HTML	`uri` or `base64`	To send HTML as plain text instead, use `type: "text"`
`email`	EML / MSG	`uri` or `base64`	—

A non-text item must carry either uri or base64 — a non-text item with only a text field returns 415.

Sending Multimodal Content

Switch the content field from a plain string to an array of typed ContentItem objects:

Field	Purpose
`type`	One of the modalities above
`text`	Literal text — only for `type: "text"`
`uri`	`http(s)://` (fetched server-side) or `file://` (read from server filesystem)
`base64`	Inline payload, plain base64 (no `data:` prefix)
`ext`	Extension hint (`"pdf"`, `"png"`, …); effectively required for `base64` payloads
`name`	Display filename for logs

	`uri` (`http(s)://`)	`base64`
Where the bytes live	Fetched transiently at parse time	Held verbatim in the SQLite session buffer until flush
Wire size	URL only	~4/3× the raw size (base64 inflation)
Best for	Large assets, S3/OSS presigned URLs	Small assets, or when no reachable URL exists

Prefer uri for anything large. A multi-MB base64 blob becomes multi-MB of SQLite buffer text for the buffer’s lifetime and slows request parsing. The bytes are never persisted past extraction either way — only the parsed text is.

TS=$(($(date +%s) * 1000))
curl -X POST http://127.0.0.1:8000/api/v1/memory/add \
  -H 'Content-Type: application/json' \
  -d "{
    \"session_id\": \"mm-001\",
    \"messages\": [
      {
        \"sender_id\": \"alice\",
        \"role\": \"user\",
        \"timestamp\": $TS,
        \"content\": [
          { \"type\": \"image\", \"uri\": \"https://example.com/whiteboard.png\" }
        ]
      }
    ]
  }"

{
  "session_id": "mm-001",
  "messages": [
    {
      "sender_id": "alice",
      "role": "user",
      "timestamp": 1748390400000,
      "content": [
        { "type": "text",  "text": "Here's the whiteboard from today's planning session." },
        { "type": "image", "uri": "https://example.com/whiteboard.png", "name": "whiteboard.png" }
      ]
    }
  ]
}

{
  "session_id": "mm-001",
  "messages": [
    {
      "sender_id": "alice",
      "role": "user",
      "timestamp": 1748390400000,
      "content": [
        { "type": "text", "text": "Quarterly report attached." },
        { "type": "pdf",  "base64": "JVBERi0xLjQK...", "ext": "pdf", "name": "q3.pdf" }
      ]
    }
  ]
}

import httpx

httpx.post(
    "http://127.0.0.1:8000/api/v1/memory/add",
    json={
        "session_id": "mm-001",
        "messages": [
            {
                "sender_id": "alice",
                "role": "user",
                "timestamp": 1748390400000,
                "content": [
                    {"type": "text", "text": "Here's the whiteboard from today's meeting."},
                    {"type": "image", "uri": "https://example.com/whiteboard.png"},
                ],
            }
        ],
    },
)

Local files via `file://`

A file:// URI is read from the server’s local filesystem. The path must be reachable by the server process and pass these guardrails (violation returns 415):

Must be an existing regular file (symlinks resolved)
Size ≤ EVEROS_MULTIMODAL__FILE_URI_MAX_BYTES (default 50 MiB)
If EVEROS_MULTIMODAL__FILE_URI_ALLOW_DIRS is set, the path must lie within one of the listed roots

{ "type": "pdf", "uri": "file:///srv/uploads/q3.pdf" }

Searching Multimodal Memory

Nothing special is required. Parsed text is folded into the same episodes and memory cells as text turns, so every retrieval method works across multimodal-derived memory unchanged:

curl -X POST http://127.0.0.1:8000/api/v1/memory/search \
  -H 'Content-Type: application/json' \
  -d '{
    "user_id": "alice",
    "query": "whiteboard from the planning session",
    "method": "hybrid"
  }'

See Retrieval for full details on keyword, vector, hybrid, and agentic methods.

Configuration Reference

All fields bind from environment variables (EVEROS_MULTIMODAL__<FIELD>) or the [multimodal] TOML section:

Env var	Default	Meaning
`EVEROS_MULTIMODAL__MODEL`	`google/gemini-3-flash-preview`	Parsing model; must accept `image_url` parts
`EVEROS_MULTIMODAL__API_KEY`	—	API key for the multimodal endpoint
`EVEROS_MULTIMODAL__BASE_URL`	`https://openrouter.ai/api/v1`	OpenAI-compatible base URL
`EVEROS_MULTIMODAL__MAX_CONCURRENCY`	`4`	Cap on parallel multimodal calls within one extraction
`EVEROS_MULTIMODAL__FILE_URI_MAX_BYTES`	`52428800` (50 MiB)	Max size of a `file://` asset
`EVEROS_MULTIMODAL__FILE_URI_ALLOW_DIRS`	`[]` (any)	JSON list of allowlisted base dirs for `file://` URIs

Error Handling

Two failure classes behave differently:

Condition	Result
Non-text item with no `uri` or `base64`	`415`, batch aborted
Unknown extension / no handler for the modality	`415`, batch aborted
`base64` without a resolvable `ext` to dispatch on	`415`, batch aborted
`file://` fails a guardrail (missing / too large / outside allowlist)	`415`, batch aborted
Office document but no LibreOffice on host	`503` (`CAPABILITY_UNAVAILABLE`), batch aborted
Multimodal LLM call fails (timeout / rate-limit / model rejects asset)	`200`, that item is skipped (`parse_status="failed"`), rest of batch still extracts

A malformed-input or unsupported-format problem aborts the whole /add batch with 415; a missing capability such as LibreOffice aborts with 503 (CAPABILITY_UNAVAILABLE). A transient LLM failure degrades only the affected item: the request returns 200 and the remaining messages extract normally.

Introduction

Getting Started

Core Concepts

Advanced

Multimodal Memory

How It Works

Prerequisites

Supported Content Types

Sending Multimodal Content

Local files via `file://`

Searching Multimodal Memory

Configuration Reference

Error Handling

​How It Works

​Prerequisites

​Supported Content Types

​Sending Multimodal Content

​Local files via file://

​Searching Multimodal Memory

​Configuration Reference

​Error Handling

How It Works

Prerequisites

Supported Content Types

Sending Multimodal Content

Local files via `file://`

Searching Multimodal Memory

Configuration Reference

Error Handling