Three refinements to retrieve_context, all keyed off observed failures from
test_retrieval.py:
- Library/personal split. classify_retrieval_intent now returns
(type_filter, folder_exclude_prefixes). Biographical document intent excludes
Library/* so philosophy/cognition books stop crowding out CVs and dossiers
for queries like "write me a bio".
- Near-duplicate collapse. Multi-folder copies of the same file (e.g., several
Teaching Philosophy.pdf in different application folders) used to fill the
top-N with the same content. Dedup by first-300-chars hash after rerank.
- Folder in source citations. Surface metadata.folder alongside basename so
the LLM can disambiguate among 21 CV.docx variants and the user can see
which copy a citation refers to.
Also: bump hnsw.ef_search to 500 when a WHERE filter is present.
pgvector 0.6 doesn't iterate past its initial HNSW candidate list, so a
restrictive filter that excludes the nearest neighbors otherwise returns
empty.
Replaces pure-dense top-8 retrieval with a three-stage pipeline:
- BM25 (tsvector + websearch_to_tsquery) and dense (pgvector) in parallel,
fused with Reciprocal Rank Fusion
- Optional type filter driven by classify_retrieval_intent() so questions
about prior conversations don't pull documents and vice versa
- Cross-encoder rerank (ms-marco-MiniLM-L-6-v2) over RRF candidates before
taking final top-N
Also adds scripts/reindex_docx_pptx.py — one-off re-ingest used to recover
table/header/text-box content in docx and pptx after the 93c0d89 extractor
upgrade — and scripts/test_retrieval.py to exercise the new pipeline against
representative queries.
Schema: requires GIN index on to_tsvector('english', document) (already
created out-of-band via psql since Apache AGE in shared_preload_libraries
blocks ALTER TABLE on this database).