8d560f9f5e
Replaces pure-dense top-8 retrieval with a three-stage pipeline:
- BM25 (tsvector + websearch_to_tsquery) and dense (pgvector) in parallel,
fused with Reciprocal Rank Fusion
- Optional type filter driven by classify_retrieval_intent() so questions
about prior conversations don't pull documents and vice versa
- Cross-encoder rerank (ms-marco-MiniLM-L-6-v2) over RRF candidates before
taking final top-N
Also adds scripts/reindex_docx_pptx.py — one-off re-ingest used to recover
table/header/text-box content in docx and pptx after the 93c0d89 extractor
upgrade — and scripts/test_retrieval.py to exercise the new pipeline against
representative queries.
Schema: requires GIN index on to_tsvector('english', document) (already
created out-of-band via psql since Apache AGE in shared_preload_libraries
blocks ALTER TABLE on this database).
51 lines
677 B
Plaintext
51 lines
677 B
Plaintext
# Backup files (rely on git history instead)
|
|
*.bak
|
|
*.bak.*
|
|
|
|
# Runtime artifacts
|
|
watcher_heartbeat
|
|
dreamer_state.json
|
|
corpus_integrity_report.json
|
|
watcher_state.json
|
|
watcher_status.json
|
|
reindex_status.json
|
|
|
|
# Logs (these belong in /var/log/)
|
|
*.log
|
|
|
|
# Python artifacts
|
|
__pycache__/
|
|
*.pyc
|
|
*.pyo
|
|
*.pyd
|
|
.pytest_cache/
|
|
*.egg-info/
|
|
|
|
# Virtual environment
|
|
venv/
|
|
.venv/
|
|
|
|
# Environment and secrets
|
|
.env
|
|
.env.local
|
|
.env.*.local
|
|
|
|
# Editor and OS cruft
|
|
.vscode/
|
|
.idea/
|
|
*.swp
|
|
*.swo
|
|
.DS_Store
|
|
Thumbs.db
|
|
|
|
# Local data not for repo
|
|
db/
|
|
embeddings/
|
|
experiments/summary_embeddings_cache.json
|
|
|
|
# Aaron AI runtime data (personal, do not commit)
|
|
conversations.db
|
|
sessions.db
|
|
memory.md
|
|
settings.json
|