api.py: folder-aware retrieval, near-duplicate dedup, folder in citations

Three refinements to retrieve_context, all keyed off observed failures from test_retrieval.py: - Library/personal split. classify_retrieval_intent now returns (type_filter, folder_exclude_prefixes). Biographical document intent excludes Library/* so philosophy/cognition books stop crowding out CVs and dossiers for queries like "write me a bio". - Near-duplicate collapse. Multi-folder copies of the same file (e.g., several Teaching Philosophy.pdf in different application folders) used to fill the top-N with the same content. Dedup by first-300-chars hash after rerank. - Folder in source citations. Surface metadata.folder alongside basename so the LLM can disambiguate among 21 CV.docx variants and the user can see which copy a citation refers to. Also: bump hnsw.ef_search to 500 when a WHERE filter is present. pgvector 0.6 doesn't iterate past its initial HNSW candidate list, so a restrictive filter that excludes the nearest neighbors otherwise returns empty.
2026-05-19 21:35:28 +00:00
parent 8d560f9f5e
commit 50b97e2998
2 changed files with 83 additions and 33 deletions
@@ -50,9 +50,11 @@ QUERIES = [
 ]

 for q in QUERIES:
-    intent = classify_retrieval_intent(q)
-    pieces, sources = retrieve_context(q, type_filter=intent)
+    type_filter, folder_excludes = classify_retrieval_intent(q)
+    pieces, sources = retrieve_context(
+        q, type_filter=type_filter, folder_exclude_prefixes=folder_excludes,
+    )
    print(f"\n=== {q!r} ===")
-    print(f"  intent: {intent}")
+    print(f"  type_filter: {type_filter}  folder_excludes: {folder_excludes}")
    for i, src in enumerate(sources, 1):
        print(f"  {i}. {src}")