Knowledge hybrid search
Module: packages/knowledge/src/search/ Index types involved: pgvector HNSW (cosine), tsvector GIN.
Pipeline
[ caller: searchKnowledge(orgId, query, deps, opts) ]
│
├─ embedQuery(query) (LRU cache)
│
┌──────────┴──────────┐
▼ ▼
[ vectorSearch (fanout)] [ keywordSearch (fanout) ]
│ │
└──────────┬──────────┘
▼
[ rrf(k=60) ]
│
▼
[ top-K + snippets ]- Fanout =
limit × 3from each leg so RRF has room to fuse the tails. - RRF ranks items by
Σ 1 / (k + rank_in_list),k=60from Cormack et al. The
formula is parameter-free in practice; the only knob is k.
- Cache is
(query, provider.name)keyed; default 50 entries × 60 s TTL. We never
cache result sets — knowledge supersedes fast.
Vector search
SET LOCAL hnsw.ef_search = <efSearch>;
SELECT id, title, body, tags,
1 - (embedding <=> $vec) AS similarity
FROM knowledge_entries
WHERE org_id = $org
AND embedding IS NOT NULL
[ AND superseded_by_id IS NULL ] -- when includeSuperseded=false (default)
[ AND tags @> ARRAY[...]::text[] ] -- when caller passes tags
ORDER BY embedding <=> $vec
LIMIT $fanout;ef_search defaults to 64 (HNSW build-time pgvector recommended); raise to 200 for deep-dive queries via efSearch: 200. The hard cap is 200 — anything higher trades cost for negligible recall.
Keyword search
SELECT id, title, body, tags,
ts_rank_cd(body_tsv, q) +
0.5 * (title ILIKE '%query%')::int AS rank
FROM knowledge_entries, plainto_tsquery('simple', $q) AS q
WHERE org_id = $org
AND (body_tsv @@ q OR title ILIKE '%query%')
[ AND superseded_by_id IS NULL ]
[ AND tags @> ARRAY[...]::text[] ]
ORDER BY rank DESC
LIMIT $fanout;Title boost (+0.5 * title_match) compensates for short bodies where ts_rank goes to zero.
Why hybrid
Vector alone loses on proper nouns, IDs, versions — exactly the things our knowledge graph stores. Keyword alone misses paraphrases. RRF combines them rank-only, so you don't need a calibrated score blend.
We measured this in a sibling project (PulsMCP) — recall@5 goes up 15–25 % with hybrid vs vector-only on the same eval set.
Filters
| filter | default | behaviour |
|---|---|---|
includeSuperseded | false | superseded_by_id IS NULL (only latest) |
tags | [] | array-contains tags @> ARRAY[...] |
limit | 10 | clamped to [1, 50] |
efSearch | 64 | clamped to [1, 200] |
minScore | none | drops fused entries below the threshold |
Snippets
buildSnippet(body, query) returns a 240-char window centred on the first matching token; ellipsis on either side when truncated. No HTML highlighting yet — task 36's UI adds <mark> later if needed.
What this module does NOT do
- LLM re-ranking (Phase 9 candidate — too expensive for v1).
- Personalised scoring per agent / per repo.
- Federated search across orgs (single-tenant by construction).
- Streaming results — small result sets, single round-trip is fine.