Files
aaronAI/.gitignore
T
aaron 8d560f9f5e api.py: hybrid retrieval with intent routing and cross-encoder rerank
Replaces pure-dense top-8 retrieval with a three-stage pipeline:
- BM25 (tsvector + websearch_to_tsquery) and dense (pgvector) in parallel,
  fused with Reciprocal Rank Fusion
- Optional type filter driven by classify_retrieval_intent() so questions
  about prior conversations don't pull documents and vice versa
- Cross-encoder rerank (ms-marco-MiniLM-L-6-v2) over RRF candidates before
  taking final top-N

Also adds scripts/reindex_docx_pptx.py — one-off re-ingest used to recover
table/header/text-box content in docx and pptx after the 93c0d89 extractor
upgrade — and scripts/test_retrieval.py to exercise the new pipeline against
representative queries.

Schema: requires GIN index on to_tsvector('english', document) (already
created out-of-band via psql since Apache AGE in shared_preload_libraries
blocks ALTER TABLE on this database).
2026-05-19 21:11:15 +00:00

51 lines
677 B
Plaintext

# Backup files (rely on git history instead)
*.bak
*.bak.*
# Runtime artifacts
watcher_heartbeat
dreamer_state.json
corpus_integrity_report.json
watcher_state.json
watcher_status.json
reindex_status.json
# Logs (these belong in /var/log/)
*.log
# Python artifacts
__pycache__/
*.pyc
*.pyo
*.pyd
.pytest_cache/
*.egg-info/
# Virtual environment
venv/
.venv/
# Environment and secrets
.env
.env.local
.env.*.local
# Editor and OS cruft
.vscode/
.idea/
*.swp
*.swo
.DS_Store
Thumbs.db
# Local data not for repo
db/
embeddings/
experiments/summary_embeddings_cache.json
# Aaron AI runtime data (personal, do not commit)
conversations.db
sessions.db
memory.md
settings.json