Knowledge hybrid search

Module: packages/knowledge/src/search/ Index types involved: pgvector HNSW (cosine), tsvector GIN.

Pipeline

[ caller: searchKnowledge(orgId, query, deps, opts) ]
                         │
                         ├─ embedQuery(query) (LRU cache)
                         │
              ┌──────────┴──────────┐
              ▼                     ▼
     [ vectorSearch (fanout)]  [ keywordSearch (fanout) ]
              │                     │
              └──────────┬──────────┘
                         ▼
                 [ rrf(k=60) ]
                         │
                         ▼
                 [ top-K + snippets ]

Fanout = limit × 3 from each leg so RRF has room to fuse the tails.
RRF ranks items by Σ 1 / (k + rank_in_list), k=60 from Cormack et al. The

formula is parameter-free in practice; the only knob is k.

Cache is (query, provider.name) keyed; default 50 entries × 60 s TTL. We never

cache result sets — knowledge supersedes fast.

Vector search

SET LOCAL hnsw.ef_search = <efSearch>;
SELECT id, title, body, tags,
       1 - (embedding <=> $vec) AS similarity
FROM knowledge_entries
WHERE org_id = $org
  AND embedding IS NOT NULL
  [ AND superseded_by_id IS NULL ]   -- when includeSuperseded=false (default)
  [ AND tags @> ARRAY[...]::text[] ] -- when caller passes tags
ORDER BY embedding <=> $vec
LIMIT $fanout;

ef_search defaults to 64 (HNSW build-time pgvector recommended); raise to 200 for deep-dive queries via efSearch: 200. The hard cap is 200 — anything higher trades cost for negligible recall.

Keyword search

SELECT id, title, body, tags,
       ts_rank_cd(body_tsv, q) +
         0.5 * (title ILIKE '%query%')::int AS rank
FROM knowledge_entries, plainto_tsquery('simple', $q) AS q
WHERE org_id = $org
  AND (body_tsv @@ q OR title ILIKE '%query%')
  [ AND superseded_by_id IS NULL ]
  [ AND tags @> ARRAY[...]::text[] ]
ORDER BY rank DESC
LIMIT $fanout;

Title boost (+0.5 * title_match) compensates for short bodies where ts_rank goes to zero.

Why hybrid

Vector alone loses on proper nouns, IDs, versions — exactly the things our knowledge graph stores. Keyword alone misses paraphrases. RRF combines them rank-only, so you don't need a calibrated score blend.

We measured this in a sibling project (PulsMCP) — recall@5 goes up 15–25 % with hybrid vs vector-only on the same eval set.

Filters

filter	default	behaviour
`includeSuperseded`	`false`	`superseded_by_id IS NULL` (only latest)
`tags`	`[]`	array-contains `tags @> ARRAY[...]`
`limit`	10	clamped to [1, 50]
`efSearch`	64	clamped to [1, 200]
`minScore`	none	drops fused entries below the threshold

Snippets

buildSnippet(body, query) returns a 240-char window centred on the first matching token; ellipsis on either side when truncated. No HTML highlighting yet — task 36's UI adds <mark> later if needed.

What this module does NOT do

LLM re-ranking (Phase 9 candidate — too expensive for v1).
Personalised scoring per agent / per repo.
Federated search across orgs (single-tenant by construction).
Streaming results — small result sets, single round-trip is fine.