Embeddings pipeline
Module: packages/knowledge/src/embeddings/ Queue: embeddings (BullMQ). Default model: text-embedding-3-small (1536d, OpenAI). Privacy guard: OPENAI_ZDR_CONFIRMED=true required in production (runbook).
Lifecycle
[ MCP store_knowledge ]──INSERT─▶[ knowledge_entries (embedding=NULL) ]
│
└─enqueueEmbedding({ knowledgeId, orgId })──▶[ BullMQ embeddings queue ]
│
▼
[ worker: processEmbeddingJob ]
1. SELECT row by id
2. buildEmbeddingInput(title, tags, body)
3. provider.embed([input]) — OpenAI / mock / Ollama
4. UPDATE embedding + embedding_model
5. Redis counters (best-effort)Provider interface
provider.ts defines:
interface EmbeddingProvider {
name: string; // persisted on the row as embedding_model
dims: number; // must match knowledge_entries.embedding column type
embed(texts: string[]): Promise<number[][]>;
}Three implementations ship today:
| name | use | failure modes |
|---|---|---|
openAIEmbeddingProvider() | production (default) | rate-limit, 5xx, ZDR not on |
createMockEmbeddingProvider() | tests; deterministic SHA-256 → unit vector | failTimes, alwaysFail knobs |
| _(Ollama, future)_ | Pro tier "data stays in EU" | not implemented in Phase 0 |
Each row stores the provider's name in embedding_model, so we can grep for stale embeddings after a model migration (embedding_model='text-embedding-3-small' vs 'v2').
Queue defaults
| setting | value | reason |
|---|---|---|
attempts | 3 | OpenAI 429s are typically transient |
backoff | exponential, base 10 s | first retry at 10 s, then 20 s, then 40 s |
jobId | knowledgeId | idempotency — a repeat enqueue coalesces |
removeOnComplete | { count: 500 } | keep recent history for forensic |
removeOnFail | { age: 7 * 24 * 3600 } | 7 days of failures for triage |
concurrency (worker) | 4 | OpenAI rate-limit comfortable; tune later if needed |
Cost / volume metrics
After every successful job the worker increments:
| key | meaning |
|---|---|
embeddings:generated:total | Lifetime count of vectors written. |
embeddings:tokens:month:YYYY-MM | Approx tokens charged this calendar month (chars/4). |
Errors during the metric write are warning-level only — they never fail the job.
Input shape + truncation
buildEmbeddingInput({ title, body, tags }) returns:
${title}\n\n${tags.join(' ')}\n\n${body}Trimmed to MAX_INPUT_CHARS = 24 000 (well inside the 8 192-token OpenAI cap). Char-based cap is fine for English/Czech; if we ever store CJK-heavy content we'll swap to a real tokeniser.
Boot guard
assertOpenAIZdrConfirmed() throws in NODE_ENV=production unless OPENAI_ZDR_CONFIRMED=true. The MCP entry calls it before constructing the OpenAI provider. logOpenAIBootStatus() prints a one-line confirmation/rejection so ops can grep the boot log.
When OPENAI_EUDR_ENABLED=true, the OpenAI client points at https://eu.api.openai.com/v1 — available only for Enterprise OpenAI tier. Not enabled in Phase 0.
What this module does NOT do
- Search — task 20 layers pgvector cosine + tsvector hybrid on top.
- Re-embedding on model change — there's a runbook for it (Phase 9). Today, manual
job: bump EMBEDDING_MODEL, run a sweep script that re-enqueues every row.
- DLQ wiring — BullMQ already retains failures for 7 days; an explicit DLQ queue lands
with the verification engine (task 32) where retry semantics are richer.