aaronAI

Author SHA1 Message Date

Author	SHA1	Message	Date
aaron	9e86297e2a	api.py: tool-call retrieval, drop the keyword intent classifier Removes classify_retrieval_intent and the type/folder filter parameters on retrieve_context. The keyword classifier was the same anti-pattern as the formatting-driven docx chunker: a heuristic that locks the user into specific phrasings and fails silently on anything novel. A scope enum (personal / library / conversations / memory) would have been the same heuristic in a fancier wrapper — the categories themselves are mine, not Aaron's. New shape: a retrieve_documents tool exposed to Claude. Tool takes a single query argument; the model decides when to call it, what to search for, and how many times per turn (multi-query falls out naturally for compound asks). Pre-LLM retrieval is gone — memory still rides as ground truth in the prompt, but corpus content is fetched on demand by the model with concrete queries it crafts itself, not the user's raw phrasing. retrieve_context is now pure: hybrid retrieval + cross-encoder rerank + dedup, no filters. The reranker ranks, the model judges relevance. When ranking fails (e.g. abstract instructional queries pulling philosophy books), the right fix is a better reranker, not another query-time taxonomy. That work is acknowledged but deferred. System prompt updated to teach the model about the tool and to prefer concrete tokens (named entities, project names, course codes) over abstract phrasing when constructing search queries.	2026-05-19 23:05:25 +00:00
aaron	50b97e2998	api.py: folder-aware retrieval, near-duplicate dedup, folder in citations Three refinements to retrieve_context, all keyed off observed failures from test_retrieval.py: - Library/personal split. classify_retrieval_intent now returns (type_filter, folder_exclude_prefixes). Biographical document intent excludes Library/* so philosophy/cognition books stop crowding out CVs and dossiers for queries like "write me a bio". - Near-duplicate collapse. Multi-folder copies of the same file (e.g., several Teaching Philosophy.pdf in different application folders) used to fill the top-N with the same content. Dedup by first-300-chars hash after rerank. - Folder in source citations. Surface metadata.folder alongside basename so the LLM can disambiguate among 21 CV.docx variants and the user can see which copy a citation refers to. Also: bump hnsw.ef_search to 500 when a WHERE filter is present. pgvector 0.6 doesn't iterate past its initial HNSW candidate list, so a restrictive filter that excludes the nearest neighbors otherwise returns empty.	2026-05-19 21:35:28 +00:00
aaron	8d560f9f5e	api.py: hybrid retrieval with intent routing and cross-encoder rerank Replaces pure-dense top-8 retrieval with a three-stage pipeline: - BM25 (tsvector + websearch_to_tsquery) and dense (pgvector) in parallel, fused with Reciprocal Rank Fusion - Optional type filter driven by classify_retrieval_intent() so questions about prior conversations don't pull documents and vice versa - Cross-encoder rerank (ms-marco-MiniLM-L-6-v2) over RRF candidates before taking final top-N Also adds scripts/reindex_docx_pptx.py — one-off re-ingest used to recover table/header/text-box content in docx and pptx after the `93c0d89` extractor upgrade — and scripts/test_retrieval.py to exercise the new pipeline against representative queries. Schema: requires GIN index on to_tsvector('english', document) (already created out-of-band via psql since Apache AGE in shared_preload_libraries blocks ALTER TABLE on this database).	2026-05-19 21:11:15 +00:00

9e86297e2a

api.py: tool-call retrieval, drop the keyword intent classifier

Removes classify_retrieval_intent and the type/folder filter parameters on
retrieve_context. The keyword classifier was the same anti-pattern as the
formatting-driven docx chunker: a heuristic that locks the user into specific
phrasings and fails silently on anything novel. A scope enum (personal /
library / conversations / memory) would have been the same heuristic in a
fancier wrapper — the categories themselves are mine, not Aaron's.

New shape: a retrieve_documents tool exposed to Claude. Tool takes a single
query argument; the model decides when to call it, what to search for, and
how many times per turn (multi-query falls out naturally for compound asks).
Pre-LLM retrieval is gone — memory still rides as ground truth in the prompt,
but corpus content is fetched on demand by the model with concrete queries
it crafts itself, not the user's raw phrasing.

retrieve_context is now pure: hybrid retrieval + cross-encoder rerank + dedup,
no filters. The reranker ranks, the model judges relevance. When ranking
fails (e.g. abstract instructional queries pulling philosophy books), the
right fix is a better reranker, not another query-time taxonomy. That work
is acknowledged but deferred.

System prompt updated to teach the model about the tool and to prefer
concrete tokens (named entities, project names, course codes) over abstract
phrasing when constructing search queries.

2026-05-19 23:05:25 +00:00

aaron

50b97e2998

api.py: folder-aware retrieval, near-duplicate dedup, folder in citations

Three refinements to retrieve_context, all keyed off observed failures from
test_retrieval.py:

- Library/personal split. classify_retrieval_intent now returns
  (type_filter, folder_exclude_prefixes). Biographical document intent excludes
  Library/* so philosophy/cognition books stop crowding out CVs and dossiers
  for queries like "write me a bio".

- Near-duplicate collapse. Multi-folder copies of the same file (e.g., several
  Teaching Philosophy.pdf in different application folders) used to fill the
  top-N with the same content. Dedup by first-300-chars hash after rerank.

- Folder in source citations. Surface metadata.folder alongside basename so
  the LLM can disambiguate among 21 CV.docx variants and the user can see
  which copy a citation refers to.

Also: bump hnsw.ef_search to 500 when a WHERE filter is present.
pgvector 0.6 doesn't iterate past its initial HNSW candidate list, so a
restrictive filter that excludes the nearest neighbors otherwise returns
empty.

2026-05-19 21:35:28 +00:00

aaron

8d560f9f5e

api.py: hybrid retrieval with intent routing and cross-encoder rerank

Replaces pure-dense top-8 retrieval with a three-stage pipeline:
- BM25 (tsvector + websearch_to_tsquery) and dense (pgvector) in parallel,
  fused with Reciprocal Rank Fusion
- Optional type filter driven by classify_retrieval_intent() so questions
  about prior conversations don't pull documents and vice versa
- Cross-encoder rerank (ms-marco-MiniLM-L-6-v2) over RRF candidates before
  taking final top-N

Also adds scripts/reindex_docx_pptx.py — one-off re-ingest used to recover
table/header/text-box content in docx and pptx after the 93c0d89 extractor
upgrade — and scripts/test_retrieval.py to exercise the new pipeline against
representative queries.

Schema: requires GIN index on to_tsvector('english', document) (already
created out-of-band via psql since Apache AGE in shared_preload_libraries
blocks ALTER TABLE on this database).

2026-05-19 21:11:15 +00:00

3 Commits