api.py: folder-aware retrieval, near-duplicate dedup, folder in citations

Three refinements to retrieve_context, all keyed off observed failures from
test_retrieval.py:

- Library/personal split. classify_retrieval_intent now returns
  (type_filter, folder_exclude_prefixes). Biographical document intent excludes
  Library/* so philosophy/cognition books stop crowding out CVs and dossiers
  for queries like "write me a bio".

- Near-duplicate collapse. Multi-folder copies of the same file (e.g., several
  Teaching Philosophy.pdf in different application folders) used to fill the
  top-N with the same content. Dedup by first-300-chars hash after rerank.

- Folder in source citations. Surface metadata.folder alongside basename so
  the LLM can disambiguate among 21 CV.docx variants and the user can see
  which copy a citation refers to.

Also: bump hnsw.ef_search to 500 when a WHERE filter is present.
pgvector 0.6 doesn't iterate past its initial HNSW candidate list, so a
restrictive filter that excludes the nearest neighbors otherwise returns
empty.
This commit is contained in:
2026-05-19 21:35:28 +00:00
parent 8d560f9f5e
commit 50b97e2998
2 changed files with 83 additions and 33 deletions
+5 -3
View File
@@ -50,9 +50,11 @@ QUERIES = [
]
for q in QUERIES:
intent = classify_retrieval_intent(q)
pieces, sources = retrieve_context(q, type_filter=intent)
type_filter, folder_excludes = classify_retrieval_intent(q)
pieces, sources = retrieve_context(
q, type_filter=type_filter, folder_exclude_prefixes=folder_excludes,
)
print(f"\n=== {q!r} ===")
print(f" intent: {intent}")
print(f" type_filter: {type_filter} folder_excludes: {folder_excludes}")
for i, src in enumerate(sources, 1):
print(f" {i}. {src}")