/search stack.
How It Works
ContentItem is routed through the parser, which calls a separate, vision/audio-capable LLM configured independently from the main extraction [llm] — so parsing can target a multimodal endpoint without changing boundary or extraction behaviour. Visual/audio formats (image / pdf / audio / office) always go through that LLM; a few text-bearing formats (e.g. a plain email with no inline images) can be parsed without it. The parser returns text that takes the place of the asset in the message buffer — nothing downstream knows or cares that the content originated as an image or PDF. The raw bytes are not persisted past extraction — only the parsed text is stored.
Prerequisites
Install the multimodal extra
Multimodal parsing lives behind an optional dependency group so the base install stays lean:This pulls in
everalgo-parser[svg] — the [svg] bundle adds cairosvg so SVG works out of the box.Install LibreOffice (office documents only)
Office formats (
.doc / .docx / .ppt / .pptx / .xls / .xlsx) are converted to PDF before being fed to the multimodal LLM. LibreOffice must be present on the server host:Without LibreOffice, office uploads return
415. Image, PDF, audio, HTML, and email parsing are unaffected.Supported Content Types
type | Typical formats | Payload | Notes |
|---|---|---|---|
text | — | text | Plain text; a bare string content also maps here |
image | PNG / JPG / GIF / WebP / SVG | uri or base64 | SVG via bundled cairosvg |
pdf | uri or base64 | — | |
audio | MP3 / WAV / … | uri or base64 | Endpoint must accept audio parts |
doc | DOC / DOCX / PPT / PPTX / XLS / XLSX | uri or base64 | Requires LibreOffice |
html | HTML | uri or base64 | To send HTML as plain text instead, use type: "text" |
email | EML / MSG | uri or base64 | — |
uri or base64 — a non-text item with only a text field returns 415.
Sending Multimodal Content
Switch thecontent field from a plain string to an array of typed ContentItem objects:
| Field | Purpose |
|---|---|
type | One of the modalities above |
text | Literal text — only for type: "text" |
uri | http(s):// (fetched server-side) or file:// (read from server filesystem) |
base64 | Inline payload, plain base64 (no data: prefix) |
ext | Extension hint ("pdf", "png", …); effectively required for base64 payloads |
name | Display filename for logs |
uri (http(s)://) | base64 | |
|---|---|---|
| Where the bytes live | Fetched transiently at parse time | Held verbatim in the SQLite session buffer until flush |
| Wire size | URL only | ~4/3× the raw size (base64 inflation) |
| Best for | Large assets, S3/OSS presigned URLs | Small assets, or when no reachable URL exists |
Local files via file://
A file:// URI is read from the server’s local filesystem. The path must be reachable by the server process and pass these guardrails (violation returns 415):
- Must be an existing regular file (symlinks resolved)
- Size ≤
EVEROS_MULTIMODAL__FILE_URI_MAX_BYTES(default 50 MiB) - If
EVEROS_MULTIMODAL__FILE_URI_ALLOW_DIRSis set, the path must lie within one of the listed roots
Searching Multimodal Memory
Nothing special is required. Parsed text is folded into the same episodes and memory cells as text turns, so every retrieval method works across multimodal-derived memory unchanged:keyword, vector, hybrid, and agentic methods.
Configuration Reference
All fields bind from environment variables (EVEROS_MULTIMODAL__<FIELD>) or the [multimodal] TOML section:
| Env var | Default | Meaning |
|---|---|---|
EVEROS_MULTIMODAL__MODEL | google/gemini-3-flash-preview | Parsing model; must accept image_url parts |
EVEROS_MULTIMODAL__API_KEY | — | API key for the multimodal endpoint |
EVEROS_MULTIMODAL__BASE_URL | https://openrouter.ai/api/v1 | OpenAI-compatible base URL |
EVEROS_MULTIMODAL__MAX_CONCURRENCY | 4 | Cap on parallel multimodal calls within one extraction |
EVEROS_MULTIMODAL__FILE_URI_MAX_BYTES | 52428800 (50 MiB) | Max size of a file:// asset |
EVEROS_MULTIMODAL__FILE_URI_ALLOW_DIRS | [] (any) | JSON list of allowlisted base dirs for file:// URIs |
Error Handling
Two failure classes behave differently:| Condition | Result |
|---|---|
Non-text item with no uri or base64 | 415 — batch aborted |
| Unknown extension / no handler for the modality | 415 — batch aborted |
base64 without a resolvable ext to dispatch on | 415 — batch aborted |
| Office document but no LibreOffice on host | 415 — batch aborted |
file:// fails a guardrail (missing / too large / outside allowlist) | 415 — batch aborted |
| Multimodal LLM call fails (timeout / rate-limit / model rejects asset) | 200 — that item is skipped (parse_status="failed"), rest of batch still extracts |
/add batch with 415. A transient LLM failure degrades only the affected item — the request returns 200 and the remaining messages extract normally.
