Skip to main content
EverOS turns non-text content — images, PDFs, audio, office documents, HTML, email — into the same structured, searchable memory as plain text. Attach the asset to a message at ingest time; a vision/audio-capable LLM parses it into text, and from there it flows through the identical extraction → markdown → index pipeline as any text turn. The result is fully retrievable with the same /search stack.

How It Works

POST /api/v1/memory/add
  messages[].content = [ ContentItem, ContentItem, ... ]

        ├─ text items      → used verbatim
        └─ non-text items  → multimodal LLM (everalgo-parser)

  parsed text merged back into the session buffer (in original order)

  boundary detector → extraction LLM → MemCell

  markdown (truth)  +  SQLite (state)  +  LanceDB (vector + BM25)

  retrievable via /search and /get like any text memory
Each non-text ContentItem is routed through the parser, which calls a separate, vision/audio-capable LLM configured independently from the main extraction [llm] — so parsing can target a multimodal endpoint without changing boundary or extraction behaviour. Visual/audio formats (image / pdf / audio / office) always go through that LLM; a few text-bearing formats (e.g. a plain email with no inline images) can be parsed without it. The parser returns text that takes the place of the asset in the message buffer — nothing downstream knows or cares that the content originated as an image or PDF. The raw bytes are not persisted past extraction — only the parsed text is stored.

Prerequisites

1

Install the multimodal extra

Multimodal parsing lives behind an optional dependency group so the base install stays lean:
pip install 'everos[multimodal]'
# or with uv:
uv pip install 'everos[multimodal]'
This pulls in everalgo-parser[svg] — the [svg] bundle adds cairosvg so SVG works out of the box.
2

Install LibreOffice (office documents only)

Office formats (.doc / .docx / .ppt / .pptx / .xls / .xlsx) are converted to PDF before being fed to the multimodal LLM. LibreOffice must be present on the server host:
brew install --cask libreoffice          # macOS
sudo apt-get install -y libreoffice      # Debian / Ubuntu
Without LibreOffice, office uploads return 415. Image, PDF, audio, HTML, and email parsing are unaffected.
3

Configure the multimodal LLM

The parser uses its own LLM section, independent from [llm]. The model must accept OpenAI image_url parts. everos init writes these into the generated .env:
EVEROS_MULTIMODAL__MODEL=google/gemini-3-flash-preview
EVEROS_MULTIMODAL__API_KEY=<your key>
EVEROS_MULTIMODAL__BASE_URL=https://openrouter.ai/api/v1
The default targets Gemini via OpenRouter so a single key covers both chat extraction and multimodal parsing.

Supported Content Types

typeTypical formatsPayloadNotes
texttextPlain text; a bare string content also maps here
imagePNG / JPG / GIF / WebP / SVGuri or base64SVG via bundled cairosvg
pdfPDFuri or base64
audioMP3 / WAV / …uri or base64Endpoint must accept audio parts
docDOC / DOCX / PPT / PPTX / XLS / XLSXuri or base64Requires LibreOffice
htmlHTMLuri or base64To send HTML as plain text instead, use type: "text"
emailEML / MSGuri or base64
A non-text item must carry either uri or base64 — a non-text item with only a text field returns 415.

Sending Multimodal Content

Switch the content field from a plain string to an array of typed ContentItem objects:
FieldPurpose
typeOne of the modalities above
textLiteral text — only for type: "text"
urihttp(s):// (fetched server-side) or file:// (read from server filesystem)
base64Inline payload, plain base64 (no data: prefix)
extExtension hint ("pdf", "png", …); effectively required for base64 payloads
nameDisplay filename for logs
uri (http(s)://)base64
Where the bytes liveFetched transiently at parse timeHeld verbatim in the SQLite session buffer until flush
Wire sizeURL only~4/3× the raw size (base64 inflation)
Best forLarge assets, S3/OSS presigned URLsSmall assets, or when no reachable URL exists
Prefer uri for anything large. A multi-MB base64 blob becomes multi-MB of SQLite buffer text for the buffer’s lifetime and slows request parsing. The bytes are never persisted past extraction either way — only the parsed text is.
TS=$(($(date +%s) * 1000))
curl -X POST http://127.0.0.1:8000/api/v1/memory/add \
  -H 'Content-Type: application/json' \
  -d "{
    \"session_id\": \"mm-001\",
    \"messages\": [
      {
        \"sender_id\": \"alice\",
        \"role\": \"user\",
        \"timestamp\": $TS,
        \"content\": [
          { \"type\": \"image\", \"uri\": \"https://example.com/whiteboard.png\" }
        ]
      }
    ]
  }"

Local files via file://

A file:// URI is read from the server’s local filesystem. The path must be reachable by the server process and pass these guardrails (violation returns 415):
  • Must be an existing regular file (symlinks resolved)
  • Size ≤ EVEROS_MULTIMODAL__FILE_URI_MAX_BYTES (default 50 MiB)
  • If EVEROS_MULTIMODAL__FILE_URI_ALLOW_DIRS is set, the path must lie within one of the listed roots
{ "type": "pdf", "uri": "file:///srv/uploads/q3.pdf" }

Searching Multimodal Memory

Nothing special is required. Parsed text is folded into the same episodes and memory cells as text turns, so every retrieval method works across multimodal-derived memory unchanged:
curl -X POST http://127.0.0.1:8000/api/v1/memory/search \
  -H 'Content-Type: application/json' \
  -d '{
    "user_id": "alice",
    "query": "whiteboard from the planning session",
    "method": "hybrid"
  }'
See Retrieval for full details on keyword, vector, hybrid, and agentic methods.

Configuration Reference

All fields bind from environment variables (EVEROS_MULTIMODAL__<FIELD>) or the [multimodal] TOML section:
Env varDefaultMeaning
EVEROS_MULTIMODAL__MODELgoogle/gemini-3-flash-previewParsing model; must accept image_url parts
EVEROS_MULTIMODAL__API_KEYAPI key for the multimodal endpoint
EVEROS_MULTIMODAL__BASE_URLhttps://openrouter.ai/api/v1OpenAI-compatible base URL
EVEROS_MULTIMODAL__MAX_CONCURRENCY4Cap on parallel multimodal calls within one extraction
EVEROS_MULTIMODAL__FILE_URI_MAX_BYTES52428800 (50 MiB)Max size of a file:// asset
EVEROS_MULTIMODAL__FILE_URI_ALLOW_DIRS[] (any)JSON list of allowlisted base dirs for file:// URIs

Error Handling

Two failure classes behave differently:
ConditionResult
Non-text item with no uri or base64415 — batch aborted
Unknown extension / no handler for the modality415 — batch aborted
base64 without a resolvable ext to dispatch on415 — batch aborted
Office document but no LibreOffice on host415 — batch aborted
file:// fails a guardrail (missing / too large / outside allowlist)415 — batch aborted
Multimodal LLM call fails (timeout / rate-limit / model rejects asset)200 — that item is skipped (parse_status="failed"), rest of batch still extracts
Deterministic problems abort the whole /add batch with 415. A transient LLM failure degrades only the affected item — the request returns 200 and the remaining messages extract normally.