Files
aaronAI/scripts
aaron 50b97e2998 api.py: folder-aware retrieval, near-duplicate dedup, folder in citations
Three refinements to retrieve_context, all keyed off observed failures from
test_retrieval.py:

- Library/personal split. classify_retrieval_intent now returns
  (type_filter, folder_exclude_prefixes). Biographical document intent excludes
  Library/* so philosophy/cognition books stop crowding out CVs and dossiers
  for queries like "write me a bio".

- Near-duplicate collapse. Multi-folder copies of the same file (e.g., several
  Teaching Philosophy.pdf in different application folders) used to fill the
  top-N with the same content. Dedup by first-300-chars hash after rerank.

- Folder in source citations. Surface metadata.folder alongside basename so
  the LLM can disambiguate among 21 CV.docx variants and the user can see
  which copy a citation refers to.

Also: bump hnsw.ef_search to 500 when a WHERE filter is present.
pgvector 0.6 doesn't iterate past its initial HNSW candidate list, so a
restrictive filter that excludes the nearest neighbors otherwise returns
empty.
2026-05-19 21:35:28 +00:00
..