api.py: folder-aware retrieval, near-duplicate dedup, folder in citations
Three refinements to retrieve_context, all keyed off observed failures from test_retrieval.py: - Library/personal split. classify_retrieval_intent now returns (type_filter, folder_exclude_prefixes). Biographical document intent excludes Library/* so philosophy/cognition books stop crowding out CVs and dossiers for queries like "write me a bio". - Near-duplicate collapse. Multi-folder copies of the same file (e.g., several Teaching Philosophy.pdf in different application folders) used to fill the top-N with the same content. Dedup by first-300-chars hash after rerank. - Folder in source citations. Surface metadata.folder alongside basename so the LLM can disambiguate among 21 CV.docx variants and the user can see which copy a citation refers to. Also: bump hnsw.ef_search to 500 when a WHERE filter is present. pgvector 0.6 doesn't iterate past its initial HNSW candidate list, so a restrictive filter that excludes the nearest neighbors otherwise returns empty.
This commit is contained in:
@@ -50,9 +50,11 @@ QUERIES = [
|
||||
]
|
||||
|
||||
for q in QUERIES:
|
||||
intent = classify_retrieval_intent(q)
|
||||
pieces, sources = retrieve_context(q, type_filter=intent)
|
||||
type_filter, folder_excludes = classify_retrieval_intent(q)
|
||||
pieces, sources = retrieve_context(
|
||||
q, type_filter=type_filter, folder_exclude_prefixes=folder_excludes,
|
||||
)
|
||||
print(f"\n=== {q!r} ===")
|
||||
print(f" intent: {intent}")
|
||||
print(f" type_filter: {type_filter} folder_excludes: {folder_excludes}")
|
||||
for i, src in enumerate(sources, 1):
|
||||
print(f" {i}. {src}")
|
||||
|
||||
Reference in New Issue
Block a user