Compare commits
28 Commits
7b77794319
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 5582549321 | |||
| 3ec9a48151 | |||
| 9d09d3fa14 | |||
| f185ed60cb | |||
| a4735053c2 | |||
| f682d8c6a0 | |||
| 151c756b89 | |||
| e96bf40b2f | |||
| 313c0f0341 | |||
| d2ec20e373 | |||
| 10bb29290a | |||
| 9bb083f065 | |||
| 430ea239dd | |||
| 0a1e2b4f61 | |||
| 8c2c597687 | |||
| fda61ad622 | |||
| 84994f9282 | |||
| 9e86297e2a | |||
| 9955c7e383 | |||
| 50b97e2998 | |||
| 8d560f9f5e | |||
| 732e450d21 | |||
| 63c58b5bb3 | |||
| 6c2af55e7e | |||
| 5b4a299414 | |||
| b09e35892c | |||
| e38d283e59 | |||
| 8e61e4dedb |
@@ -8,6 +8,7 @@ dreamer_state.json
|
|||||||
corpus_integrity_report.json
|
corpus_integrity_report.json
|
||||||
watcher_state.json
|
watcher_state.json
|
||||||
watcher_status.json
|
watcher_status.json
|
||||||
|
reindex_status.json
|
||||||
|
|
||||||
# Logs (these belong in /var/log/)
|
# Logs (these belong in /var/log/)
|
||||||
*.log
|
*.log
|
||||||
|
|||||||
@@ -0,0 +1,105 @@
|
|||||||
|
# OCR install record — 2026-05-04
|
||||||
|
|
||||||
|
## Machine
|
||||||
|
|
||||||
|
- Host: aaronai-01 (VPS)
|
||||||
|
- OS: Ubuntu 24.04 noble (kernel 6.8.0-110-generic, x86_64)
|
||||||
|
|
||||||
|
## apt packages installed
|
||||||
|
|
||||||
|
| package | version | source |
|
||||||
|
|---|---|---|
|
||||||
|
| tesseract-ocr | 5.3.4-1build5 | noble |
|
||||||
|
| tesseract-ocr-eng | 1:4.1.0-2 | noble |
|
||||||
|
| tesseract-ocr-osd | 1:4.1.0-2 | noble (automatic) |
|
||||||
|
| libtesseract5 | 5.3.4-1build5 | noble (automatic) |
|
||||||
|
|
||||||
|
## pip packages installed (into /home/aaron/aaronai/venv)
|
||||||
|
|
||||||
|
| package | version |
|
||||||
|
|---|---|
|
||||||
|
| pytesseract | 0.3.13 |
|
||||||
|
| ocrmypdf | 17.4.2 |
|
||||||
|
|
||||||
|
Direct dependencies pulled in by the two installs above (also new in venv): `pikepdf 10.5.1`, `pdfminer-six 20260107`, `pypdfium2 5.7.1`, `img2pdf 0.6.3`, `pi-heif 1.3.0`, `cryptography 47.0.0`, `cffi 2.0.0`, `pycparser 3.0`, `Deprecated 1.3.1`, `deprecation 2.1.0`, `defusedxml 0.7.1`, `fonttools 4.62.1`, `fpdf2 2.8.7`, `uharfbuzz 0.54.1`, `wrapt 2.1.2`, `pluggy 1.6.0`. `pillow` was already at 12.2.0.
|
||||||
|
|
||||||
|
## Smoke test 1 — `tesseract --version`
|
||||||
|
|
||||||
|
```
|
||||||
|
tesseract 5.3.4
|
||||||
|
leptonica-1.82.0
|
||||||
|
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
|
||||||
|
Found AVX512BW
|
||||||
|
Found AVX512F
|
||||||
|
```
|
||||||
|
|
||||||
|
## Smoke test 2 — `tesseract --list-langs`
|
||||||
|
|
||||||
|
```
|
||||||
|
List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (2):
|
||||||
|
eng
|
||||||
|
osd
|
||||||
|
```
|
||||||
|
|
||||||
|
## Smoke test 3 — pytesseract on a slide image
|
||||||
|
|
||||||
|
- Input pptx: `/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF555 3D Computational/GH Slicer Notes.pptx`
|
||||||
|
- Extracted image: `ppt/media/image1.PNG` (1768×504 PNG)
|
||||||
|
- Wall-clock: 0.220s
|
||||||
|
- Chars extracted: 126
|
||||||
|
- First 200 chars:
|
||||||
|
|
||||||
|
```
|
||||||
|
Generates the Bounding Box for NESS
|
||||||
|
|
||||||
|
round(x, 4), round(y, 4), round(z, 4), round(a, 4))
|
||||||
|
|
||||||
|
Format ("HSS5 X(0} ¥(1} W(2} H(3)",
|
||||||
|
```
|
||||||
|
|
||||||
|
Note: the first image in `Renders.pptx` (image1.jpg, 640×480) returned 0 chars on first attempt. Sampled 15 images in `Renders.pptx`; all 15 are pure rendered designs/photographs with no text. Switched to `GH Slicer Notes.pptx` (per the original 4-image-only-pptx candidate list) where image1.PNG is a textual code-screenshot. Tesseract behavior is correct in both cases; `Renders.pptx` is not a useful OCR test target because it contains no text. Some character-recognition noise on the code screenshot (e.g. `¥(1}` for `Y(1)`, mojibake on parentheses/braces) — acceptable for a baseline smoke; production tuning is a worker-design concern.
|
||||||
|
|
||||||
|
## Smoke test 4 — ocrmypdf on a Lexmark CX510de scan
|
||||||
|
|
||||||
|
- Input PDF: `/home/aaron/nextcloud/data/data/aaron/files/Admin/Dossier/Tenure/Dossier Scan 2022/image2022-01-07-133846 - CAryn.pdf` (4 pages, Producer: Lexmark CX510de, Creator: HardCopy)
|
||||||
|
- Command: `ocrmypdf --skip-text -l eng <input> /tmp/ocr_smoke/caryn_ocred.pdf`
|
||||||
|
- Wall-clock: 3.72s (whole PDF, 4 pages)
|
||||||
|
- Exit: 0
|
||||||
|
- After OCR, `pdftotext` on the output produced 2347 chars (2270 non-whitespace).
|
||||||
|
- First 200 chars of OCR'd text:
|
||||||
|
|
||||||
|
```
|
||||||
|
nN New Paltz
|
||||||
|
STATE UNIVERSITY OF NEW YORK
|
||||||
|
|
||||||
|
The Honors Program
|
||||||
|
|
||||||
|
May 30, 2017
|
||||||
|
|
||||||
|
Dear Aaron,
|
||||||
|
|
||||||
|
Thank you for serving as a reader for Caryn Byllott’s thesis on "Recall/Reconstruct: The Exploration of
|
||||||
|
Memory
|
||||||
|
```
|
||||||
|
|
||||||
|
Real readable English. The "nN" header is the Lexmark logo glyph; otherwise clean. ~0.93s/page on this scan, which is the reference number for sizing the async worker queue.
|
||||||
|
|
||||||
|
## Reference timing
|
||||||
|
|
||||||
|
| operation | input size | wall-clock |
|
||||||
|
|---|---|---|
|
||||||
|
| pytesseract single image | 1768×504 PNG | 0.22s |
|
||||||
|
| ocrmypdf 4-page scan | 4 pages, ~A4 | 3.72s (~0.93s/page) |
|
||||||
|
|
||||||
|
## Deferred — project dep-tracking
|
||||||
|
|
||||||
|
The project has no dependency manifest on disk: no `requirements.txt`, `pyproject.toml`, `setup.py`, `Pipfile`, or `poetry.lock`. Pip deps live only in `venv/`. The OCR install adds `pytesseract` and `ocrmypdf` (plus their transitive closure listed above) to that untracked venv state.
|
||||||
|
|
||||||
|
This commit does not introduce a manifest. Tracking the dep-manifest decision as its own followup; the natural deadline is the capture-path integration commit, where `import pytesseract` will become load-bearing in the repo. If the manifest question is unresolved by then, that integration commit is the right place to address it.
|
||||||
|
|
||||||
|
## Followups
|
||||||
|
|
||||||
|
- Async OCR worker (separate session). Use the reference timing above to size the queue.
|
||||||
|
- Capture path integration: phone-camera images → `pytesseract.image_to_string` → existing chunk/embed pipeline.
|
||||||
|
- Backlog processing of 75 scanned PDFs (Lexmark CX510de and similar) and the 4 image-only pptx (`Renders.pptx`, `Ribbon Cutting Slideshow.pptx`, two `GH Slicer Notes` variants). Per the smoke results, `Renders.pptx` is unlikely to yield useful OCR text — it is rendered-design content, not scanned documents — and may instead need exclusion rather than processing.
|
||||||
|
- Project dep-manifest decision (see Deferred section above).
|
||||||
@@ -0,0 +1,4 @@
|
|||||||
|
# Local backups created by apply.sh — environment state, not source.
|
||||||
|
# Keeping these out of version control prevents repo bloat and avoids
|
||||||
|
# checking in graphiti-core's Apache-2.0 source under our repo's tree.
|
||||||
|
backups/
|
||||||
@@ -0,0 +1,58 @@
|
|||||||
|
# graphiti-core Patches — FalkorDB Vector Index Support
|
||||||
|
|
||||||
|
Vendored patches against graphiti-core 0.29.0 adding native FalkorDB
|
||||||
|
vector index support. Three files modified, all under
|
||||||
|
`graphiti_core/driver/falkordb/` and `graphiti_core/graph_queries.py`.
|
||||||
|
No changes to Neo4j or Kuzu code paths.
|
||||||
|
|
||||||
|
## Why this exists
|
||||||
|
|
||||||
|
graphiti-core's FalkorDB driver uses interpreted Cypher cosine math
|
||||||
|
(`vec.cosineDistance(...)`) for similarity search. Each query becomes a
|
||||||
|
full table scan over Entity/RELATES_TO/Community nodes. At ~4,000+
|
||||||
|
entities, single-episode ingest's resolve-against-existing-graph step
|
||||||
|
takes 8+ minutes and bulk ingest hangs FalkorDB. FalkorDB itself
|
||||||
|
supports `db.idx.vector.queryNodes` and `db.idx.vector.queryRelationships`
|
||||||
|
procedures backed by HNSW indexes; graphiti-core's driver doesn't use
|
||||||
|
them.
|
||||||
|
|
||||||
|
These patches:
|
||||||
|
|
||||||
|
1. Add `get_vector_indices()` to `graph_queries.py` returning CREATE
|
||||||
|
VECTOR INDEX statements for FalkorDB on Entity.name_embedding,
|
||||||
|
RELATES_TO.fact_embedding, and Community.name_embedding.
|
||||||
|
2. Extend `falkordb_driver.py:build_indices_and_constraints()` to create
|
||||||
|
the vector indexes alongside range and fulltext indexes.
|
||||||
|
3. Rewrite the three vector-similarity call sites in
|
||||||
|
`falkordb/operations/search_ops.py` to use
|
||||||
|
`db.idx.vector.queryNodes` and `db.idx.vector.queryRelationships`
|
||||||
|
instead of full-scan cosine math. Over-fetches by a configurable
|
||||||
|
multiplier to handle filter rejections.
|
||||||
|
|
||||||
|
## Files
|
||||||
|
|
||||||
|
| Patched file | Source |
|
||||||
|
|---|---|
|
||||||
|
| `graphiti_core/graph_queries.py` | Adds `get_vector_indices()` |
|
||||||
|
| `graphiti_core/driver/falkordb/falkordb_driver.py` | Extends `build_indices_and_constraints` |
|
||||||
|
| `graphiti_core/driver/falkordb/operations/search_ops.py` | Three query rewrites |
|
||||||
|
|
||||||
|
## How to apply
|
||||||
|
|
||||||
|
`./apply.sh` — backs up the originals into `./backups/<timestamp>/`
|
||||||
|
and copies the patched files over.
|
||||||
|
|
||||||
|
## How to revert
|
||||||
|
|
||||||
|
Move the timestamped backup back over the venv:
|
||||||
|
|
||||||
|
cp backups/<ts>/graph_queries.py /home/aaron/aaronai/venv/lib/python3.12/site-packages/graphiti_core/graph_queries.py
|
||||||
|
# ...etc
|
||||||
|
|
||||||
|
## Upstream candidate
|
||||||
|
|
||||||
|
Documented gap (issue #1263 references it indirectly via vector store
|
||||||
|
overlay RFC). Maintainers' attention is on Milvus/external vector DB
|
||||||
|
overlay; this patch is the FalkorDB-native alternative for users who
|
||||||
|
don't want a separate vector DB. Consider PR after empirical validation
|
||||||
|
in production.
|
||||||
Executable
+77
@@ -0,0 +1,77 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# apply.sh — Apply the BirdAI vendored graphiti-core patches.
|
||||||
|
#
|
||||||
|
# Backs up the original venv files into ./backups/<timestamp>/ before
|
||||||
|
# overwriting. The backup directory layout mirrors the venv layout so a
|
||||||
|
# revert is just a tree copy back.
|
||||||
|
#
|
||||||
|
# Usage: ./apply.sh
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
PATCH_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||||
|
VENV_BASE="/home/aaron/aaronai/venv/lib/python3.12/site-packages"
|
||||||
|
TIMESTAMP="$(date +%Y%m%d-%H%M%S)"
|
||||||
|
BACKUP_DIR="$PATCH_DIR/backups/$TIMESTAMP"
|
||||||
|
|
||||||
|
# Files to patch — paths relative to graphiti_core/.
|
||||||
|
FILES=(
|
||||||
|
"graph_queries.py"
|
||||||
|
"driver/falkordb_driver.py"
|
||||||
|
"driver/falkordb/operations/search_ops.py"
|
||||||
|
)
|
||||||
|
|
||||||
|
echo "graphiti-core vendored patch apply — BirdAI"
|
||||||
|
echo "Patch directory: $PATCH_DIR"
|
||||||
|
echo "Venv target: $VENV_BASE/graphiti_core/"
|
||||||
|
echo "Backup to: $BACKUP_DIR"
|
||||||
|
echo
|
||||||
|
|
||||||
|
# Pre-flight: confirm all source patch files exist.
|
||||||
|
for rel in "${FILES[@]}"; do
|
||||||
|
if [ ! -f "$PATCH_DIR/graphiti_core/$rel" ]; then
|
||||||
|
echo "ERROR: missing patch file: $PATCH_DIR/graphiti_core/$rel" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# Pre-flight: confirm all target venv files exist.
|
||||||
|
for rel in "${FILES[@]}"; do
|
||||||
|
if [ ! -f "$VENV_BASE/graphiti_core/$rel" ]; then
|
||||||
|
echo "ERROR: missing venv file: $VENV_BASE/graphiti_core/$rel" >&2
|
||||||
|
echo " graphiti-core may not be installed, or version differs from 0.29.0." >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# Backup originals.
|
||||||
|
echo "[1/3] Backing up originals..."
|
||||||
|
for rel in "${FILES[@]}"; do
|
||||||
|
backup_path="$BACKUP_DIR/graphiti_core/$rel"
|
||||||
|
mkdir -p "$(dirname "$backup_path")"
|
||||||
|
cp "$VENV_BASE/graphiti_core/$rel" "$backup_path"
|
||||||
|
echo " backed up: $rel"
|
||||||
|
done
|
||||||
|
echo
|
||||||
|
|
||||||
|
# Apply patches by copying.
|
||||||
|
echo "[2/3] Applying patches..."
|
||||||
|
for rel in "${FILES[@]}"; do
|
||||||
|
cp "$PATCH_DIR/graphiti_core/$rel" "$VENV_BASE/graphiti_core/$rel"
|
||||||
|
echo " patched: $rel"
|
||||||
|
done
|
||||||
|
echo
|
||||||
|
|
||||||
|
# Sanity check: confirm patched files have the marker.
|
||||||
|
echo "[3/3] Verifying patched files..."
|
||||||
|
for rel in "${FILES[@]}"; do
|
||||||
|
if grep -q "PATCHED 2026-05-02" "$VENV_BASE/graphiti_core/$rel"; then
|
||||||
|
echo " OK: $rel contains patch marker"
|
||||||
|
else
|
||||||
|
echo " WARNING: $rel missing patch marker (may be expected for graph_queries.py — its docstring uses the marker only in the module header)"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
echo
|
||||||
|
echo "Done. Backup: $BACKUP_DIR"
|
||||||
|
echo "Restart the sidecar to pick up changes:"
|
||||||
|
echo " sudo systemctl restart aaronai-graphiti.service"
|
||||||
@@ -0,0 +1,904 @@
|
|||||||
|
"""
|
||||||
|
Copyright 2024, Zep Software, Inc.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
you may not use this file except in compliance with the License.
|
||||||
|
You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from graphiti_core.driver.driver import GraphProvider
|
||||||
|
from graphiti_core.driver.falkordb import STOPWORDS
|
||||||
|
from graphiti_core.driver.operations.search_ops import SearchOperations
|
||||||
|
from graphiti_core.driver.query_executor import QueryExecutor
|
||||||
|
from graphiti_core.driver.record_parsers import (
|
||||||
|
community_node_from_record,
|
||||||
|
entity_edge_from_record,
|
||||||
|
entity_node_from_record,
|
||||||
|
episodic_node_from_record,
|
||||||
|
)
|
||||||
|
from graphiti_core.edges import EntityEdge
|
||||||
|
from graphiti_core.graph_queries import (
|
||||||
|
get_nodes_query,
|
||||||
|
get_relationships_query,
|
||||||
|
get_vector_cosine_func_query,
|
||||||
|
)
|
||||||
|
from graphiti_core.models.edges.edge_db_queries import get_entity_edge_return_query
|
||||||
|
from graphiti_core.models.nodes.node_db_queries import (
|
||||||
|
COMMUNITY_NODE_RETURN,
|
||||||
|
EPISODIC_NODE_RETURN,
|
||||||
|
get_entity_node_return_query,
|
||||||
|
)
|
||||||
|
from graphiti_core.nodes import CommunityNode, EntityNode, EpisodicNode
|
||||||
|
from graphiti_core.search.search_filters import (
|
||||||
|
SearchFilters,
|
||||||
|
edge_search_filter_query_constructor,
|
||||||
|
node_search_filter_query_constructor,
|
||||||
|
)
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
MAX_QUERY_LENGTH = 128
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Vector index dispatcher (PATCHED 2026-05-02, BirdAI vendored patch).
|
||||||
|
#
|
||||||
|
# graphiti-core's FalkorDB driver historically composed similarity queries
|
||||||
|
# using `vec.cosineDistance(...)` in interpreted Cypher, which produces a
|
||||||
|
# full-table scan for every search. FalkorDB supports native vector indexes
|
||||||
|
# via `db.idx.vector.queryNodes` and `db.idx.vector.queryRelationships`;
|
||||||
|
# this dispatcher uses them when present and falls back to the cosine math
|
||||||
|
# otherwise.
|
||||||
|
#
|
||||||
|
# Index existence is checked once per (label, attribute, entity_type) and
|
||||||
|
# cached at module scope. The cache should be invalidated whenever
|
||||||
|
# `build_indices_and_constraints` runs (since indexes may have been created
|
||||||
|
# or dropped). FalkorDriver.build_indices_and_constraints is patched to
|
||||||
|
# call `_invalidate_falkordb_vector_index_cache()` after building.
|
||||||
|
#
|
||||||
|
# Over-fetch factor (VECTOR_INDEX_CANDIDATE_MULTIPLIER from graph_queries)
|
||||||
|
# preserves recall when WHERE filters reject some of the top-k candidates.
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
from graphiti_core.graph_queries import (
|
||||||
|
VECTOR_INDEX_CANDIDATE_MULTIPLIER,
|
||||||
|
get_vector_cosine_func_query,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Cache: key = (label, attribute, entity_type), value = bool
|
||||||
|
# entity_type is 'NODE' or 'RELATIONSHIP'.
|
||||||
|
_FALKORDB_VECTOR_INDEX_CACHE: dict[tuple[str, str, str], bool] = {}
|
||||||
|
|
||||||
|
|
||||||
|
def _invalidate_falkordb_vector_index_cache() -> None:
|
||||||
|
"""Clear the vector-index existence cache. Call after build_indices_and_constraints."""
|
||||||
|
_FALKORDB_VECTOR_INDEX_CACHE.clear()
|
||||||
|
|
||||||
|
|
||||||
|
async def _falkordb_vector_index_exists(
|
||||||
|
executor: QueryExecutor,
|
||||||
|
label: str,
|
||||||
|
attribute: str,
|
||||||
|
entity_type: str,
|
||||||
|
) -> bool:
|
||||||
|
"""Check whether a FalkorDB vector index exists for the given target.
|
||||||
|
|
||||||
|
entity_type is 'NODE' for node-label indexes, 'RELATIONSHIP' for edge-type indexes.
|
||||||
|
Result is cached at module scope; call _invalidate_falkordb_vector_index_cache()
|
||||||
|
after building or dropping indexes.
|
||||||
|
"""
|
||||||
|
key = (label, attribute, entity_type)
|
||||||
|
if key in _FALKORDB_VECTOR_INDEX_CACHE:
|
||||||
|
return _FALKORDB_VECTOR_INDEX_CACHE[key]
|
||||||
|
|
||||||
|
try:
|
||||||
|
records, _, _ = await executor.execute_query(
|
||||||
|
"CALL db.indexes() YIELD label, properties, types, entitytype "
|
||||||
|
"RETURN label, properties, types, entitytype"
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
# If we cannot enumerate indexes, fall back to "no index" rather than
|
||||||
|
# propagating the error. The fallback cosine-math path is correct,
|
||||||
|
# just slower.
|
||||||
|
logger.warning(f"FalkorDB vector index probe failed; assuming none exist: {e}")
|
||||||
|
_FALKORDB_VECTOR_INDEX_CACHE[key] = False
|
||||||
|
return False
|
||||||
|
|
||||||
|
found = False
|
||||||
|
for r in records:
|
||||||
|
# Records come back as dict-like rows keyed by column name (not
|
||||||
|
# tuples). Access by string keys matching the YIELD clause above.
|
||||||
|
rec_label = r.get('label') if hasattr(r, 'get') else r['label']
|
||||||
|
rec_props = r.get('properties') if hasattr(r, 'get') else r['properties']
|
||||||
|
rec_types = r.get('types') if hasattr(r, 'get') else r['types']
|
||||||
|
rec_entitytype = r.get('entitytype') if hasattr(r, 'get') else r['entitytype']
|
||||||
|
if rec_props is None:
|
||||||
|
rec_props = []
|
||||||
|
if rec_types is None:
|
||||||
|
rec_types = {}
|
||||||
|
|
||||||
|
if rec_label != label:
|
||||||
|
continue
|
||||||
|
if rec_entitytype is not None and rec_entitytype != entity_type:
|
||||||
|
continue
|
||||||
|
if attribute not in rec_props:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# rec_types is a dict like {attribute: ['VECTOR', ...], ...} or sometimes
|
||||||
|
# a flat list — handle both shapes.
|
||||||
|
if isinstance(rec_types, dict):
|
||||||
|
attr_types = rec_types.get(attribute, [])
|
||||||
|
else:
|
||||||
|
attr_types = rec_types
|
||||||
|
if 'VECTOR' in attr_types:
|
||||||
|
found = True
|
||||||
|
break
|
||||||
|
|
||||||
|
_FALKORDB_VECTOR_INDEX_CACHE[key] = found
|
||||||
|
return found
|
||||||
|
|
||||||
|
|
||||||
|
def _falkordb_vector_node_search_cypher(
|
||||||
|
label: str,
|
||||||
|
embedding_attr: str,
|
||||||
|
search_vector_param: str,
|
||||||
|
use_index: bool,
|
||||||
|
) -> tuple[str, str]:
|
||||||
|
"""Build the cypher prefix and node-binding for a node-vector search.
|
||||||
|
|
||||||
|
Returns (prefix, node_var) where:
|
||||||
|
- prefix is the Cypher fragment that binds the node variable and a
|
||||||
|
`score` variable. With index, it's a CALL ... YIELD; without, it's
|
||||||
|
a MATCH plus WITH cosine math.
|
||||||
|
- node_var is the variable name the caller's downstream Cypher should
|
||||||
|
reference (always 'n' here for parity with the existing code).
|
||||||
|
|
||||||
|
The caller appends WHERE filters and RETURN/ORDER BY/LIMIT as usual.
|
||||||
|
The over-fetch parameter `$candidate_k` must be passed by the caller
|
||||||
|
when use_index is True.
|
||||||
|
"""
|
||||||
|
if use_index:
|
||||||
|
return (
|
||||||
|
f"CALL db.idx.vector.queryNodes("
|
||||||
|
f"'{label}', '{embedding_attr}', $candidate_k, vecf32({search_vector_param})"
|
||||||
|
f") YIELD node, score "
|
||||||
|
f"WITH node AS n, score "
|
||||||
|
), "n"
|
||||||
|
# Fallback: original cosine math path
|
||||||
|
cosine = get_vector_cosine_func_query(
|
||||||
|
f"n.{embedding_attr}", search_vector_param, GraphProvider.FALKORDB
|
||||||
|
)
|
||||||
|
return (
|
||||||
|
f"MATCH (n:{label}) "
|
||||||
|
f"WITH n, {cosine} AS score "
|
||||||
|
), "n"
|
||||||
|
|
||||||
|
|
||||||
|
def _falkordb_vector_edge_search_cypher(
|
||||||
|
relationship_type: str,
|
||||||
|
embedding_attr: str,
|
||||||
|
search_vector_param: str,
|
||||||
|
use_index: bool,
|
||||||
|
) -> tuple[str, str]:
|
||||||
|
"""Build the cypher prefix and edge-binding for an edge-vector search.
|
||||||
|
|
||||||
|
Returns (prefix, edge_var). With the index, the procedure binds the
|
||||||
|
relationship variable; we then MATCH source and target via the existing
|
||||||
|
edge to recover (n)-[e]->(m). Without the index, it's the original
|
||||||
|
MATCH-and-cosine path.
|
||||||
|
|
||||||
|
Variable name is 'e' for parity with existing code; source/target are
|
||||||
|
'n' and 'm' respectively, also for parity.
|
||||||
|
"""
|
||||||
|
if use_index:
|
||||||
|
return (
|
||||||
|
f"CALL db.idx.vector.queryRelationships("
|
||||||
|
f"'{relationship_type}', '{embedding_attr}', $candidate_k, vecf32({search_vector_param})"
|
||||||
|
f") YIELD relationship, score "
|
||||||
|
f"MATCH (n:Entity)-[e:{relationship_type}]->(m:Entity) "
|
||||||
|
f"WHERE e = relationship "
|
||||||
|
f"WITH DISTINCT e, n, m, score "
|
||||||
|
), "e"
|
||||||
|
# Fallback
|
||||||
|
cosine = get_vector_cosine_func_query(
|
||||||
|
f"e.{embedding_attr}", search_vector_param, GraphProvider.FALKORDB
|
||||||
|
)
|
||||||
|
return (
|
||||||
|
f"MATCH (n:Entity)-[e:{relationship_type}]->(m:Entity) "
|
||||||
|
f"WITH DISTINCT e, n, m, {cosine} AS score "
|
||||||
|
), "e"
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# FalkorDB separator characters that break text into tokens
|
||||||
|
_SEPARATOR_MAP = str.maketrans(
|
||||||
|
{
|
||||||
|
',': ' ',
|
||||||
|
'.': ' ',
|
||||||
|
'<': ' ',
|
||||||
|
'>': ' ',
|
||||||
|
'{': ' ',
|
||||||
|
'}': ' ',
|
||||||
|
'[': ' ',
|
||||||
|
']': ' ',
|
||||||
|
'"': ' ',
|
||||||
|
"'": ' ',
|
||||||
|
':': ' ',
|
||||||
|
';': ' ',
|
||||||
|
'!': ' ',
|
||||||
|
'@': ' ',
|
||||||
|
'#': ' ',
|
||||||
|
'$': ' ',
|
||||||
|
'%': ' ',
|
||||||
|
'^': ' ',
|
||||||
|
'&': ' ',
|
||||||
|
'*': ' ',
|
||||||
|
'(': ' ',
|
||||||
|
')': ' ',
|
||||||
|
'-': ' ',
|
||||||
|
'+': ' ',
|
||||||
|
'=': ' ',
|
||||||
|
'~': ' ',
|
||||||
|
'?': ' ',
|
||||||
|
'|': ' ',
|
||||||
|
'/': ' ',
|
||||||
|
'\\': ' ',
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _sanitize(query: str) -> str:
|
||||||
|
"""Replace FalkorDB special characters with whitespace."""
|
||||||
|
sanitized = query.translate(_SEPARATOR_MAP)
|
||||||
|
return ' '.join(sanitized.split())
|
||||||
|
|
||||||
|
|
||||||
|
def _build_falkor_fulltext_query(
|
||||||
|
query: str,
|
||||||
|
group_ids: list[str] | None = None,
|
||||||
|
max_query_length: int = MAX_QUERY_LENGTH,
|
||||||
|
) -> str:
|
||||||
|
"""Build a fulltext query string for FalkorDB using RedisSearch syntax."""
|
||||||
|
if group_ids is None or len(group_ids) == 0:
|
||||||
|
group_filter = ''
|
||||||
|
else:
|
||||||
|
escaped_group_ids = [f'"{gid}"' for gid in group_ids]
|
||||||
|
group_values = '|'.join(escaped_group_ids)
|
||||||
|
group_filter = f'(@group_id:{group_values})'
|
||||||
|
|
||||||
|
sanitized_query = _sanitize(query)
|
||||||
|
|
||||||
|
# Remove stopwords and empty tokens
|
||||||
|
query_words = sanitized_query.split()
|
||||||
|
filtered_words = [word for word in query_words if word and word.lower() not in STOPWORDS]
|
||||||
|
sanitized_query = ' | '.join(filtered_words)
|
||||||
|
|
||||||
|
if len(sanitized_query.split(' ')) + len(group_ids or '') >= max_query_length:
|
||||||
|
return ''
|
||||||
|
|
||||||
|
full_query = group_filter + ' (' + sanitized_query + ')'
|
||||||
|
return full_query
|
||||||
|
|
||||||
|
|
||||||
|
class FalkorSearchOperations(SearchOperations):
|
||||||
|
# --- Node search ---
|
||||||
|
|
||||||
|
async def node_fulltext_search(
|
||||||
|
self,
|
||||||
|
executor: QueryExecutor,
|
||||||
|
query: str,
|
||||||
|
search_filter: SearchFilters,
|
||||||
|
group_ids: list[str] | None = None,
|
||||||
|
limit: int = 10,
|
||||||
|
) -> list[EntityNode]:
|
||||||
|
fuzzy_query = _build_falkor_fulltext_query(query, group_ids)
|
||||||
|
if fuzzy_query == '':
|
||||||
|
return []
|
||||||
|
|
||||||
|
filter_queries, filter_params = node_search_filter_query_constructor(
|
||||||
|
search_filter, GraphProvider.FALKORDB
|
||||||
|
)
|
||||||
|
|
||||||
|
if group_ids is not None:
|
||||||
|
filter_queries.append('n.group_id IN $group_ids')
|
||||||
|
filter_params['group_ids'] = group_ids
|
||||||
|
|
||||||
|
filter_query = ''
|
||||||
|
if filter_queries:
|
||||||
|
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
|
||||||
|
|
||||||
|
cypher = (
|
||||||
|
get_nodes_query(
|
||||||
|
'node_name_and_summary', '$query', limit=limit, provider=GraphProvider.FALKORDB
|
||||||
|
)
|
||||||
|
+ 'YIELD node AS n, score'
|
||||||
|
+ filter_query
|
||||||
|
+ """
|
||||||
|
WITH n, score
|
||||||
|
ORDER BY score DESC
|
||||||
|
LIMIT $limit
|
||||||
|
RETURN
|
||||||
|
"""
|
||||||
|
+ get_entity_node_return_query(GraphProvider.FALKORDB)
|
||||||
|
)
|
||||||
|
|
||||||
|
records, _, _ = await executor.execute_query(
|
||||||
|
cypher,
|
||||||
|
query=fuzzy_query,
|
||||||
|
limit=limit,
|
||||||
|
**filter_params,
|
||||||
|
)
|
||||||
|
|
||||||
|
return [entity_node_from_record(r) for r in records]
|
||||||
|
|
||||||
|
async def node_similarity_search(
|
||||||
|
self,
|
||||||
|
executor: QueryExecutor,
|
||||||
|
search_vector: list[float],
|
||||||
|
search_filter: SearchFilters,
|
||||||
|
group_ids: list[str] | None = None,
|
||||||
|
limit: int = 10,
|
||||||
|
min_score: float = 0.6,
|
||||||
|
) -> list[EntityNode]:
|
||||||
|
filter_queries, filter_params = node_search_filter_query_constructor(
|
||||||
|
search_filter, GraphProvider.FALKORDB
|
||||||
|
)
|
||||||
|
|
||||||
|
if group_ids is not None:
|
||||||
|
filter_queries.append('n.group_id IN $group_ids')
|
||||||
|
filter_params['group_ids'] = group_ids
|
||||||
|
|
||||||
|
filter_query = ''
|
||||||
|
if filter_queries:
|
||||||
|
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
|
||||||
|
|
||||||
|
# PATCHED 2026-05-02 (BirdAI vendored patch): use FalkorDB native vector
|
||||||
|
# index when available; fall back to interpreted-Cypher cosine math
|
||||||
|
# otherwise. The filter clause's position changes between paths
|
||||||
|
# (after MATCH for fallback, after YIELD for index path), but the
|
||||||
|
# filter expressions themselves are identical because they reference
|
||||||
|
# the bound variable `n` either way.
|
||||||
|
use_index = await _falkordb_vector_index_exists(
|
||||||
|
executor, 'Entity', 'name_embedding', 'NODE'
|
||||||
|
)
|
||||||
|
prefix, _ = _falkordb_vector_node_search_cypher(
|
||||||
|
'Entity', 'name_embedding', '$search_vector', use_index
|
||||||
|
)
|
||||||
|
where_clauses = []
|
||||||
|
if filter_query:
|
||||||
|
where_clauses.append(filter_query.replace(' WHERE ', '', 1).strip())
|
||||||
|
where_clauses.append('score > $min_score')
|
||||||
|
unified_where = ' WHERE ' + ' AND '.join(where_clauses)
|
||||||
|
|
||||||
|
cypher = (
|
||||||
|
prefix
|
||||||
|
+ unified_where
|
||||||
|
+ """
|
||||||
|
RETURN
|
||||||
|
"""
|
||||||
|
+ get_entity_node_return_query(GraphProvider.FALKORDB)
|
||||||
|
+ """
|
||||||
|
ORDER BY score DESC
|
||||||
|
LIMIT $limit
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
params = dict(
|
||||||
|
search_vector=search_vector,
|
||||||
|
limit=limit,
|
||||||
|
min_score=min_score,
|
||||||
|
**filter_params,
|
||||||
|
)
|
||||||
|
if use_index:
|
||||||
|
params['candidate_k'] = limit * VECTOR_INDEX_CANDIDATE_MULTIPLIER
|
||||||
|
records, _, _ = await executor.execute_query(cypher, **params)
|
||||||
|
|
||||||
|
return [entity_node_from_record(r) for r in records]
|
||||||
|
|
||||||
|
async def node_bfs_search(
|
||||||
|
self,
|
||||||
|
executor: QueryExecutor,
|
||||||
|
origin_uuids: list[str],
|
||||||
|
search_filter: SearchFilters,
|
||||||
|
max_depth: int,
|
||||||
|
group_ids: list[str] | None = None,
|
||||||
|
limit: int = 10,
|
||||||
|
) -> list[EntityNode]:
|
||||||
|
if not origin_uuids or max_depth < 1:
|
||||||
|
return []
|
||||||
|
|
||||||
|
filter_queries, filter_params = node_search_filter_query_constructor(
|
||||||
|
search_filter, GraphProvider.FALKORDB
|
||||||
|
)
|
||||||
|
|
||||||
|
if group_ids is not None:
|
||||||
|
filter_queries.append('n.group_id IN $group_ids')
|
||||||
|
filter_queries.append('origin.group_id IN $group_ids')
|
||||||
|
filter_params['group_ids'] = group_ids
|
||||||
|
|
||||||
|
filter_query = ''
|
||||||
|
if filter_queries:
|
||||||
|
filter_query = ' AND ' + (' AND '.join(filter_queries))
|
||||||
|
|
||||||
|
cypher = (
|
||||||
|
f"""
|
||||||
|
UNWIND $bfs_origin_node_uuids AS origin_uuid
|
||||||
|
MATCH (origin {{uuid: origin_uuid}})-[:RELATES_TO|MENTIONS*1..{max_depth}]->(n:Entity)
|
||||||
|
WHERE n.group_id = origin.group_id
|
||||||
|
"""
|
||||||
|
+ filter_query
|
||||||
|
+ """
|
||||||
|
RETURN
|
||||||
|
"""
|
||||||
|
+ get_entity_node_return_query(GraphProvider.FALKORDB)
|
||||||
|
+ """
|
||||||
|
LIMIT $limit
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
records, _, _ = await executor.execute_query(
|
||||||
|
cypher,
|
||||||
|
bfs_origin_node_uuids=origin_uuids,
|
||||||
|
limit=limit,
|
||||||
|
**filter_params,
|
||||||
|
)
|
||||||
|
|
||||||
|
return [entity_node_from_record(r) for r in records]
|
||||||
|
|
||||||
|
# --- Edge search ---
|
||||||
|
|
||||||
|
async def edge_fulltext_search(
|
||||||
|
self,
|
||||||
|
executor: QueryExecutor,
|
||||||
|
query: str,
|
||||||
|
search_filter: SearchFilters,
|
||||||
|
group_ids: list[str] | None = None,
|
||||||
|
limit: int = 10,
|
||||||
|
) -> list[EntityEdge]:
|
||||||
|
fuzzy_query = _build_falkor_fulltext_query(query, group_ids)
|
||||||
|
if fuzzy_query == '':
|
||||||
|
return []
|
||||||
|
|
||||||
|
filter_queries, filter_params = edge_search_filter_query_constructor(
|
||||||
|
search_filter, GraphProvider.FALKORDB
|
||||||
|
)
|
||||||
|
|
||||||
|
if group_ids is not None:
|
||||||
|
filter_queries.append('e.group_id IN $group_ids')
|
||||||
|
filter_params['group_ids'] = group_ids
|
||||||
|
|
||||||
|
filter_query = ''
|
||||||
|
if filter_queries:
|
||||||
|
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
|
||||||
|
|
||||||
|
cypher = (
|
||||||
|
get_relationships_query(
|
||||||
|
'edge_name_and_fact', limit=limit, provider=GraphProvider.FALKORDB
|
||||||
|
)
|
||||||
|
+ """
|
||||||
|
YIELD relationship AS rel, score
|
||||||
|
MATCH (n:Entity)-[e:RELATES_TO {uuid: rel.uuid}]->(m:Entity)
|
||||||
|
"""
|
||||||
|
+ filter_query
|
||||||
|
+ """
|
||||||
|
WITH e, score, n, m
|
||||||
|
RETURN
|
||||||
|
"""
|
||||||
|
+ get_entity_edge_return_query(GraphProvider.FALKORDB)
|
||||||
|
+ """
|
||||||
|
ORDER BY score DESC
|
||||||
|
LIMIT $limit
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
records, _, _ = await executor.execute_query(
|
||||||
|
cypher,
|
||||||
|
query=fuzzy_query,
|
||||||
|
limit=limit,
|
||||||
|
**filter_params,
|
||||||
|
)
|
||||||
|
|
||||||
|
return [entity_edge_from_record(r) for r in records]
|
||||||
|
|
||||||
|
async def edge_similarity_search(
|
||||||
|
self,
|
||||||
|
executor: QueryExecutor,
|
||||||
|
search_vector: list[float],
|
||||||
|
source_node_uuid: str | None,
|
||||||
|
target_node_uuid: str | None,
|
||||||
|
search_filter: SearchFilters,
|
||||||
|
group_ids: list[str] | None = None,
|
||||||
|
limit: int = 10,
|
||||||
|
min_score: float = 0.6,
|
||||||
|
) -> list[EntityEdge]:
|
||||||
|
filter_queries, filter_params = edge_search_filter_query_constructor(
|
||||||
|
search_filter, GraphProvider.FALKORDB
|
||||||
|
)
|
||||||
|
|
||||||
|
if group_ids is not None:
|
||||||
|
filter_queries.append('e.group_id IN $group_ids')
|
||||||
|
filter_params['group_ids'] = group_ids
|
||||||
|
|
||||||
|
if source_node_uuid is not None:
|
||||||
|
filter_params['source_uuid'] = source_node_uuid
|
||||||
|
filter_queries.append('n.uuid = $source_uuid')
|
||||||
|
|
||||||
|
if target_node_uuid is not None:
|
||||||
|
filter_params['target_uuid'] = target_node_uuid
|
||||||
|
filter_queries.append('m.uuid = $target_uuid')
|
||||||
|
|
||||||
|
filter_query = ''
|
||||||
|
if filter_queries:
|
||||||
|
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
|
||||||
|
|
||||||
|
# PATCHED 2026-05-02 (BirdAI vendored patch): use FalkorDB native vector
|
||||||
|
# index on RELATES_TO.fact_embedding when available. The unindexed
|
||||||
|
# fallback is the same MATCH-and-cosine math that previously hung
|
||||||
|
# for 6+ minutes on a 4,000-entity graph; this is the load-bearing
|
||||||
|
# call site that motivated the patch.
|
||||||
|
use_index = await _falkordb_vector_index_exists(
|
||||||
|
executor, 'RELATES_TO', 'fact_embedding', 'RELATIONSHIP'
|
||||||
|
)
|
||||||
|
prefix, _ = _falkordb_vector_edge_search_cypher(
|
||||||
|
'RELATES_TO', 'fact_embedding', '$search_vector', use_index
|
||||||
|
)
|
||||||
|
where_clauses = []
|
||||||
|
if filter_query:
|
||||||
|
where_clauses.append(filter_query.replace(' WHERE ', '', 1).strip())
|
||||||
|
where_clauses.append('score > $min_score')
|
||||||
|
unified_where = ' WHERE ' + ' AND '.join(where_clauses)
|
||||||
|
|
||||||
|
cypher = (
|
||||||
|
prefix
|
||||||
|
+ unified_where
|
||||||
|
+ """
|
||||||
|
RETURN
|
||||||
|
"""
|
||||||
|
+ get_entity_edge_return_query(GraphProvider.FALKORDB)
|
||||||
|
+ """
|
||||||
|
ORDER BY score DESC
|
||||||
|
LIMIT $limit
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
params = dict(
|
||||||
|
search_vector=search_vector,
|
||||||
|
limit=limit,
|
||||||
|
min_score=min_score,
|
||||||
|
**filter_params,
|
||||||
|
)
|
||||||
|
if use_index:
|
||||||
|
params['candidate_k'] = limit * VECTOR_INDEX_CANDIDATE_MULTIPLIER
|
||||||
|
records, _, _ = await executor.execute_query(cypher, **params)
|
||||||
|
|
||||||
|
return [entity_edge_from_record(r) for r in records]
|
||||||
|
|
||||||
|
async def edge_bfs_search(
|
||||||
|
self,
|
||||||
|
executor: QueryExecutor,
|
||||||
|
origin_uuids: list[str],
|
||||||
|
max_depth: int,
|
||||||
|
search_filter: SearchFilters,
|
||||||
|
group_ids: list[str] | None = None,
|
||||||
|
limit: int = 10,
|
||||||
|
) -> list[EntityEdge]:
|
||||||
|
if not origin_uuids:
|
||||||
|
return []
|
||||||
|
|
||||||
|
filter_queries, filter_params = edge_search_filter_query_constructor(
|
||||||
|
search_filter, GraphProvider.FALKORDB
|
||||||
|
)
|
||||||
|
|
||||||
|
if group_ids is not None:
|
||||||
|
filter_queries.append('e.group_id IN $group_ids')
|
||||||
|
filter_params['group_ids'] = group_ids
|
||||||
|
|
||||||
|
filter_query = ''
|
||||||
|
if filter_queries:
|
||||||
|
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
|
||||||
|
|
||||||
|
cypher = (
|
||||||
|
f"""
|
||||||
|
UNWIND $bfs_origin_node_uuids AS origin_uuid
|
||||||
|
MATCH path = (origin {{uuid: origin_uuid}})-[:RELATES_TO|MENTIONS*1..{max_depth}]->(:Entity)
|
||||||
|
UNWIND relationships(path) AS rel
|
||||||
|
MATCH (n:Entity)-[e:RELATES_TO {{uuid: rel.uuid}}]-(m:Entity)
|
||||||
|
"""
|
||||||
|
+ filter_query
|
||||||
|
+ """
|
||||||
|
RETURN DISTINCT
|
||||||
|
"""
|
||||||
|
+ get_entity_edge_return_query(GraphProvider.FALKORDB)
|
||||||
|
+ """
|
||||||
|
LIMIT $limit
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
records, _, _ = await executor.execute_query(
|
||||||
|
cypher,
|
||||||
|
bfs_origin_node_uuids=origin_uuids,
|
||||||
|
depth=max_depth,
|
||||||
|
limit=limit,
|
||||||
|
**filter_params,
|
||||||
|
)
|
||||||
|
|
||||||
|
return [entity_edge_from_record(r) for r in records]
|
||||||
|
|
||||||
|
# --- Episode search ---
|
||||||
|
|
||||||
|
async def episode_fulltext_search(
|
||||||
|
self,
|
||||||
|
executor: QueryExecutor,
|
||||||
|
query: str,
|
||||||
|
search_filter: SearchFilters, # noqa: ARG002
|
||||||
|
group_ids: list[str] | None = None,
|
||||||
|
limit: int = 10,
|
||||||
|
) -> list[EpisodicNode]:
|
||||||
|
fuzzy_query = _build_falkor_fulltext_query(query, group_ids)
|
||||||
|
if fuzzy_query == '':
|
||||||
|
return []
|
||||||
|
|
||||||
|
filter_params: dict[str, Any] = {}
|
||||||
|
group_filter_query = ''
|
||||||
|
if group_ids is not None:
|
||||||
|
group_filter_query += '\nAND e.group_id IN $group_ids'
|
||||||
|
filter_params['group_ids'] = group_ids
|
||||||
|
|
||||||
|
cypher = (
|
||||||
|
get_nodes_query(
|
||||||
|
'episode_content', '$query', limit=limit, provider=GraphProvider.FALKORDB
|
||||||
|
)
|
||||||
|
+ """
|
||||||
|
YIELD node AS episode, score
|
||||||
|
MATCH (e:Episodic)
|
||||||
|
WHERE e.uuid = episode.uuid
|
||||||
|
"""
|
||||||
|
+ group_filter_query
|
||||||
|
+ """
|
||||||
|
RETURN
|
||||||
|
"""
|
||||||
|
+ EPISODIC_NODE_RETURN
|
||||||
|
+ """
|
||||||
|
ORDER BY score DESC
|
||||||
|
LIMIT $limit
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
records, _, _ = await executor.execute_query(
|
||||||
|
cypher, query=fuzzy_query, limit=limit, **filter_params
|
||||||
|
)
|
||||||
|
|
||||||
|
return [episodic_node_from_record(r) for r in records]
|
||||||
|
|
||||||
|
# --- Community search ---
|
||||||
|
|
||||||
|
async def community_fulltext_search(
|
||||||
|
self,
|
||||||
|
executor: QueryExecutor,
|
||||||
|
query: str,
|
||||||
|
group_ids: list[str] | None = None,
|
||||||
|
limit: int = 10,
|
||||||
|
) -> list[CommunityNode]:
|
||||||
|
fuzzy_query = _build_falkor_fulltext_query(query, group_ids)
|
||||||
|
if fuzzy_query == '':
|
||||||
|
return []
|
||||||
|
|
||||||
|
filter_params: dict[str, Any] = {}
|
||||||
|
group_filter_query = ''
|
||||||
|
if group_ids is not None:
|
||||||
|
group_filter_query = 'WHERE c.group_id IN $group_ids'
|
||||||
|
filter_params['group_ids'] = group_ids
|
||||||
|
|
||||||
|
cypher = (
|
||||||
|
get_nodes_query(
|
||||||
|
'community_name', '$query', limit=limit, provider=GraphProvider.FALKORDB
|
||||||
|
)
|
||||||
|
+ """
|
||||||
|
YIELD node AS c, score
|
||||||
|
WITH c, score
|
||||||
|
"""
|
||||||
|
+ group_filter_query
|
||||||
|
+ """
|
||||||
|
RETURN
|
||||||
|
"""
|
||||||
|
+ COMMUNITY_NODE_RETURN
|
||||||
|
+ """
|
||||||
|
ORDER BY score DESC
|
||||||
|
LIMIT $limit
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
records, _, _ = await executor.execute_query(
|
||||||
|
cypher, query=fuzzy_query, limit=limit, **filter_params
|
||||||
|
)
|
||||||
|
|
||||||
|
return [community_node_from_record(r) for r in records]
|
||||||
|
|
||||||
|
async def community_similarity_search(
|
||||||
|
self,
|
||||||
|
executor: QueryExecutor,
|
||||||
|
search_vector: list[float],
|
||||||
|
group_ids: list[str] | None = None,
|
||||||
|
limit: int = 10,
|
||||||
|
min_score: float = 0.6,
|
||||||
|
) -> list[CommunityNode]:
|
||||||
|
query_params: dict[str, Any] = {}
|
||||||
|
|
||||||
|
group_filter_query = ''
|
||||||
|
if group_ids is not None:
|
||||||
|
group_filter_query += ' WHERE c.group_id IN $group_ids'
|
||||||
|
query_params['group_ids'] = group_ids
|
||||||
|
|
||||||
|
# PATCHED 2026-05-02 (BirdAI vendored patch): use FalkorDB native vector
|
||||||
|
# index on Community.name_embedding when available. Note: the existing
|
||||||
|
# filter is built into `group_filter_query` (already prefixed with
|
||||||
|
# ' WHERE ' if non-empty) and uses variable `c`. The dispatcher binds
|
||||||
|
# the node as `n` for parity with the helper signature, then we
|
||||||
|
# re-bind to `c` via WITH so the rest of the query is unchanged.
|
||||||
|
use_index = await _falkordb_vector_index_exists(
|
||||||
|
executor, 'Community', 'name_embedding', 'NODE'
|
||||||
|
)
|
||||||
|
prefix, _ = _falkordb_vector_node_search_cypher(
|
||||||
|
'Community', 'name_embedding', '$search_vector', use_index
|
||||||
|
)
|
||||||
|
prefix = prefix + ' WITH n AS c, score '
|
||||||
|
where_clauses = []
|
||||||
|
if group_filter_query:
|
||||||
|
where_clauses.append(group_filter_query.replace(' WHERE ', '', 1).strip())
|
||||||
|
where_clauses.append('score > $min_score')
|
||||||
|
unified_where = ' WHERE ' + ' AND '.join(where_clauses)
|
||||||
|
|
||||||
|
cypher = (
|
||||||
|
prefix
|
||||||
|
+ unified_where
|
||||||
|
+ """
|
||||||
|
RETURN
|
||||||
|
"""
|
||||||
|
+ COMMUNITY_NODE_RETURN
|
||||||
|
+ """
|
||||||
|
ORDER BY score DESC
|
||||||
|
LIMIT $limit
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
params = dict(
|
||||||
|
search_vector=search_vector,
|
||||||
|
limit=limit,
|
||||||
|
min_score=min_score,
|
||||||
|
**query_params,
|
||||||
|
)
|
||||||
|
if use_index:
|
||||||
|
params['candidate_k'] = limit * VECTOR_INDEX_CANDIDATE_MULTIPLIER
|
||||||
|
records, _, _ = await executor.execute_query(cypher, **params)
|
||||||
|
|
||||||
|
return [community_node_from_record(r) for r in records]
|
||||||
|
|
||||||
|
# --- Rerankers ---
|
||||||
|
|
||||||
|
async def node_distance_reranker(
|
||||||
|
self,
|
||||||
|
executor: QueryExecutor,
|
||||||
|
node_uuids: list[str],
|
||||||
|
center_node_uuid: str,
|
||||||
|
min_score: float = 0,
|
||||||
|
) -> list[EntityNode]:
|
||||||
|
filtered_uuids = [u for u in node_uuids if u != center_node_uuid]
|
||||||
|
scores: dict[str, float] = {center_node_uuid: 0.0}
|
||||||
|
|
||||||
|
cypher = """
|
||||||
|
UNWIND $node_uuids AS node_uuid
|
||||||
|
MATCH (center:Entity {uuid: $center_uuid})-[:RELATES_TO]-(n:Entity {uuid: node_uuid})
|
||||||
|
RETURN 1 AS score, node_uuid AS uuid
|
||||||
|
"""
|
||||||
|
|
||||||
|
results, _, _ = await executor.execute_query(
|
||||||
|
cypher,
|
||||||
|
node_uuids=filtered_uuids,
|
||||||
|
center_uuid=center_node_uuid,
|
||||||
|
)
|
||||||
|
|
||||||
|
for result in results:
|
||||||
|
scores[result['uuid']] = result['score']
|
||||||
|
|
||||||
|
for uuid in filtered_uuids:
|
||||||
|
if uuid not in scores:
|
||||||
|
scores[uuid] = float('inf')
|
||||||
|
|
||||||
|
filtered_uuids.sort(key=lambda cur_uuid: scores[cur_uuid])
|
||||||
|
|
||||||
|
if center_node_uuid in node_uuids:
|
||||||
|
scores[center_node_uuid] = 0.1
|
||||||
|
filtered_uuids = [center_node_uuid] + filtered_uuids
|
||||||
|
|
||||||
|
reranked_uuids = [u for u in filtered_uuids if (1 / scores[u]) >= min_score]
|
||||||
|
|
||||||
|
if not reranked_uuids:
|
||||||
|
return []
|
||||||
|
|
||||||
|
get_query = """
|
||||||
|
MATCH (n:Entity)
|
||||||
|
WHERE n.uuid IN $uuids
|
||||||
|
RETURN
|
||||||
|
""" + get_entity_node_return_query(GraphProvider.FALKORDB)
|
||||||
|
|
||||||
|
records, _, _ = await executor.execute_query(get_query, uuids=reranked_uuids)
|
||||||
|
|
||||||
|
node_map = {r['uuid']: entity_node_from_record(r) for r in records}
|
||||||
|
return [node_map[u] for u in reranked_uuids if u in node_map]
|
||||||
|
|
||||||
|
async def episode_mentions_reranker(
|
||||||
|
self,
|
||||||
|
executor: QueryExecutor,
|
||||||
|
node_uuids: list[str],
|
||||||
|
min_score: float = 0,
|
||||||
|
) -> list[EntityNode]:
|
||||||
|
if not node_uuids:
|
||||||
|
return []
|
||||||
|
|
||||||
|
scores: dict[str, float] = {}
|
||||||
|
|
||||||
|
results, _, _ = await executor.execute_query(
|
||||||
|
"""
|
||||||
|
UNWIND $node_uuids AS node_uuid
|
||||||
|
MATCH (episode:Episodic)-[r:MENTIONS]->(n:Entity {uuid: node_uuid})
|
||||||
|
RETURN count(*) AS score, n.uuid AS uuid
|
||||||
|
""",
|
||||||
|
node_uuids=node_uuids,
|
||||||
|
)
|
||||||
|
|
||||||
|
for result in results:
|
||||||
|
scores[result['uuid']] = result['score']
|
||||||
|
|
||||||
|
for uuid in node_uuids:
|
||||||
|
if uuid not in scores:
|
||||||
|
scores[uuid] = float('inf')
|
||||||
|
|
||||||
|
sorted_uuids = list(node_uuids)
|
||||||
|
sorted_uuids.sort(key=lambda cur_uuid: scores[cur_uuid])
|
||||||
|
|
||||||
|
reranked_uuids = [u for u in sorted_uuids if scores[u] >= min_score]
|
||||||
|
|
||||||
|
if not reranked_uuids:
|
||||||
|
return []
|
||||||
|
|
||||||
|
get_query = """
|
||||||
|
MATCH (n:Entity)
|
||||||
|
WHERE n.uuid IN $uuids
|
||||||
|
RETURN
|
||||||
|
""" + get_entity_node_return_query(GraphProvider.FALKORDB)
|
||||||
|
|
||||||
|
records, _, _ = await executor.execute_query(get_query, uuids=reranked_uuids)
|
||||||
|
|
||||||
|
node_map = {r['uuid']: entity_node_from_record(r) for r in records}
|
||||||
|
return [node_map[u] for u in reranked_uuids if u in node_map]
|
||||||
|
|
||||||
|
# --- Filter builders ---
|
||||||
|
|
||||||
|
def build_node_search_filters(self, search_filters: SearchFilters) -> Any:
|
||||||
|
filter_queries, filter_params = node_search_filter_query_constructor(
|
||||||
|
search_filters, GraphProvider.FALKORDB
|
||||||
|
)
|
||||||
|
return {'filter_queries': filter_queries, 'filter_params': filter_params}
|
||||||
|
|
||||||
|
def build_edge_search_filters(self, search_filters: SearchFilters) -> Any:
|
||||||
|
filter_queries, filter_params = edge_search_filter_query_constructor(
|
||||||
|
search_filters, GraphProvider.FALKORDB
|
||||||
|
)
|
||||||
|
return {'filter_queries': filter_queries, 'filter_params': filter_params}
|
||||||
|
|
||||||
|
# --- Fulltext query builder ---
|
||||||
|
|
||||||
|
def build_fulltext_query(
|
||||||
|
self,
|
||||||
|
query: str,
|
||||||
|
group_ids: list[str] | None = None,
|
||||||
|
max_query_length: int = MAX_QUERY_LENGTH,
|
||||||
|
) -> str:
|
||||||
|
return _build_falkor_fulltext_query(query, group_ids, max_query_length)
|
||||||
@@ -0,0 +1,444 @@
|
|||||||
|
"""
|
||||||
|
Copyright 2024, Zep Software, Inc.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
you may not use this file except in compliance with the License.
|
||||||
|
You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import datetime
|
||||||
|
import logging
|
||||||
|
from typing import TYPE_CHECKING, Any
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from falkordb import Graph as FalkorGraph
|
||||||
|
from falkordb.asyncio import FalkorDB
|
||||||
|
else:
|
||||||
|
try:
|
||||||
|
from falkordb import Graph as FalkorGraph
|
||||||
|
from falkordb.asyncio import FalkorDB
|
||||||
|
except ImportError:
|
||||||
|
# If falkordb is not installed, raise an ImportError
|
||||||
|
raise ImportError(
|
||||||
|
'falkordb is required for FalkorDriver. '
|
||||||
|
'Install it with: pip install graphiti-core[falkordb]'
|
||||||
|
) from None
|
||||||
|
|
||||||
|
from graphiti_core.driver.driver import GraphDriver, GraphDriverSession, GraphProvider
|
||||||
|
from graphiti_core.driver.falkordb import STOPWORDS as STOPWORDS
|
||||||
|
from graphiti_core.driver.falkordb.operations.community_edge_ops import (
|
||||||
|
FalkorCommunityEdgeOperations,
|
||||||
|
)
|
||||||
|
from graphiti_core.driver.falkordb.operations.community_node_ops import (
|
||||||
|
FalkorCommunityNodeOperations,
|
||||||
|
)
|
||||||
|
from graphiti_core.driver.falkordb.operations.entity_edge_ops import FalkorEntityEdgeOperations
|
||||||
|
from graphiti_core.driver.falkordb.operations.entity_node_ops import FalkorEntityNodeOperations
|
||||||
|
from graphiti_core.driver.falkordb.operations.episode_node_ops import FalkorEpisodeNodeOperations
|
||||||
|
from graphiti_core.driver.falkordb.operations.episodic_edge_ops import FalkorEpisodicEdgeOperations
|
||||||
|
from graphiti_core.driver.falkordb.operations.graph_ops import FalkorGraphMaintenanceOperations
|
||||||
|
from graphiti_core.driver.falkordb.operations.has_episode_edge_ops import (
|
||||||
|
FalkorHasEpisodeEdgeOperations,
|
||||||
|
)
|
||||||
|
from graphiti_core.driver.falkordb.operations.next_episode_edge_ops import (
|
||||||
|
FalkorNextEpisodeEdgeOperations,
|
||||||
|
)
|
||||||
|
from graphiti_core.driver.falkordb.operations.saga_node_ops import FalkorSagaNodeOperations
|
||||||
|
from graphiti_core.driver.falkordb.operations.search_ops import FalkorSearchOperations
|
||||||
|
from graphiti_core.driver.operations.community_edge_ops import CommunityEdgeOperations
|
||||||
|
from graphiti_core.driver.operations.community_node_ops import CommunityNodeOperations
|
||||||
|
from graphiti_core.driver.operations.entity_edge_ops import EntityEdgeOperations
|
||||||
|
from graphiti_core.driver.operations.entity_node_ops import EntityNodeOperations
|
||||||
|
from graphiti_core.driver.operations.episode_node_ops import EpisodeNodeOperations
|
||||||
|
from graphiti_core.driver.operations.episodic_edge_ops import EpisodicEdgeOperations
|
||||||
|
from graphiti_core.driver.operations.graph_ops import GraphMaintenanceOperations
|
||||||
|
from graphiti_core.driver.operations.has_episode_edge_ops import HasEpisodeEdgeOperations
|
||||||
|
from graphiti_core.driver.operations.next_episode_edge_ops import NextEpisodeEdgeOperations
|
||||||
|
from graphiti_core.driver.operations.saga_node_ops import SagaNodeOperations
|
||||||
|
from graphiti_core.driver.operations.search_ops import SearchOperations
|
||||||
|
from graphiti_core.graph_queries import get_fulltext_indices, get_range_indices, get_vector_indices
|
||||||
|
from graphiti_core.helpers import validate_group_ids
|
||||||
|
from graphiti_core.utils.datetime_utils import convert_datetimes_to_strings
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class FalkorDriverSession(GraphDriverSession):
|
||||||
|
provider = GraphProvider.FALKORDB
|
||||||
|
|
||||||
|
def __init__(self, graph: FalkorGraph):
|
||||||
|
self.graph = graph
|
||||||
|
|
||||||
|
async def __aenter__(self):
|
||||||
|
return self
|
||||||
|
|
||||||
|
async def __aexit__(self, exc_type, exc, tb):
|
||||||
|
# No cleanup needed for Falkor, but method must exist
|
||||||
|
pass
|
||||||
|
|
||||||
|
async def close(self):
|
||||||
|
# No explicit close needed for FalkorDB, but method must exist
|
||||||
|
pass
|
||||||
|
|
||||||
|
async def execute_write(self, func, *args, **kwargs):
|
||||||
|
# Directly await the provided async function with `self` as the transaction/session
|
||||||
|
return await func(self, *args, **kwargs)
|
||||||
|
|
||||||
|
async def run(self, query: str | list, **kwargs: Any) -> Any:
|
||||||
|
# FalkorDB does not support argument for Label Set, so it's converted into an array of queries
|
||||||
|
if isinstance(query, list):
|
||||||
|
for cypher, params in query:
|
||||||
|
params = convert_datetimes_to_strings(params)
|
||||||
|
await self.graph.query(str(cypher), params) # type: ignore[reportUnknownArgumentType]
|
||||||
|
else:
|
||||||
|
params = dict(kwargs)
|
||||||
|
params = convert_datetimes_to_strings(params)
|
||||||
|
await self.graph.query(str(query), params) # type: ignore[reportUnknownArgumentType]
|
||||||
|
# Assuming `graph.query` is async (ideal); otherwise, wrap in executor
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
class FalkorDriver(GraphDriver):
|
||||||
|
provider = GraphProvider.FALKORDB
|
||||||
|
default_group_id: str = '\\_'
|
||||||
|
fulltext_syntax: str = '@' # FalkorDB uses a redisearch-like syntax for fulltext queries
|
||||||
|
aoss_client: None = None
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
host: str = 'localhost',
|
||||||
|
port: int = 6379,
|
||||||
|
username: str | None = None,
|
||||||
|
password: str | None = None,
|
||||||
|
falkor_db: FalkorDB | None = None,
|
||||||
|
database: str = 'default_db',
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Initialize the FalkorDB driver.
|
||||||
|
|
||||||
|
FalkorDB is a multi-tenant graph database.
|
||||||
|
To connect, provide the host and port.
|
||||||
|
The default parameters assume a local (on-premises) FalkorDB instance.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
host (str): The host where FalkorDB is running.
|
||||||
|
port (int): The port on which FalkorDB is listening.
|
||||||
|
username (str | None): The username for authentication (if required).
|
||||||
|
password (str | None): The password for authentication (if required).
|
||||||
|
falkor_db (FalkorDB | None): An existing FalkorDB instance to use instead of creating a new one.
|
||||||
|
database (str): The name of the database to connect to. Defaults to 'default_db'.
|
||||||
|
"""
|
||||||
|
super().__init__()
|
||||||
|
self._database = database
|
||||||
|
if falkor_db is not None:
|
||||||
|
# If a FalkorDB instance is provided, use it directly
|
||||||
|
self.client = falkor_db
|
||||||
|
else:
|
||||||
|
self.client = FalkorDB(host=host, port=port, username=username, password=password)
|
||||||
|
|
||||||
|
# Instantiate FalkorDB operations
|
||||||
|
self._entity_node_ops = FalkorEntityNodeOperations()
|
||||||
|
self._episode_node_ops = FalkorEpisodeNodeOperations()
|
||||||
|
self._community_node_ops = FalkorCommunityNodeOperations()
|
||||||
|
self._saga_node_ops = FalkorSagaNodeOperations()
|
||||||
|
self._entity_edge_ops = FalkorEntityEdgeOperations()
|
||||||
|
self._episodic_edge_ops = FalkorEpisodicEdgeOperations()
|
||||||
|
self._community_edge_ops = FalkorCommunityEdgeOperations()
|
||||||
|
self._has_episode_edge_ops = FalkorHasEpisodeEdgeOperations()
|
||||||
|
self._next_episode_edge_ops = FalkorNextEpisodeEdgeOperations()
|
||||||
|
self._search_ops = FalkorSearchOperations()
|
||||||
|
self._graph_ops = FalkorGraphMaintenanceOperations()
|
||||||
|
|
||||||
|
# Schedule the indices and constraints to be built
|
||||||
|
try:
|
||||||
|
# Try to get the current event loop
|
||||||
|
loop = asyncio.get_running_loop()
|
||||||
|
# Schedule the build_indices_and_constraints to run
|
||||||
|
loop.create_task(self.build_indices_and_constraints())
|
||||||
|
except RuntimeError:
|
||||||
|
# No event loop running, this will be handled later
|
||||||
|
pass
|
||||||
|
|
||||||
|
# --- Operations properties ---
|
||||||
|
|
||||||
|
@property
|
||||||
|
def entity_node_ops(self) -> EntityNodeOperations:
|
||||||
|
return self._entity_node_ops
|
||||||
|
|
||||||
|
@property
|
||||||
|
def episode_node_ops(self) -> EpisodeNodeOperations:
|
||||||
|
return self._episode_node_ops
|
||||||
|
|
||||||
|
@property
|
||||||
|
def community_node_ops(self) -> CommunityNodeOperations:
|
||||||
|
return self._community_node_ops
|
||||||
|
|
||||||
|
@property
|
||||||
|
def saga_node_ops(self) -> SagaNodeOperations:
|
||||||
|
return self._saga_node_ops
|
||||||
|
|
||||||
|
@property
|
||||||
|
def entity_edge_ops(self) -> EntityEdgeOperations:
|
||||||
|
return self._entity_edge_ops
|
||||||
|
|
||||||
|
@property
|
||||||
|
def episodic_edge_ops(self) -> EpisodicEdgeOperations:
|
||||||
|
return self._episodic_edge_ops
|
||||||
|
|
||||||
|
@property
|
||||||
|
def community_edge_ops(self) -> CommunityEdgeOperations:
|
||||||
|
return self._community_edge_ops
|
||||||
|
|
||||||
|
@property
|
||||||
|
def has_episode_edge_ops(self) -> HasEpisodeEdgeOperations:
|
||||||
|
return self._has_episode_edge_ops
|
||||||
|
|
||||||
|
@property
|
||||||
|
def next_episode_edge_ops(self) -> NextEpisodeEdgeOperations:
|
||||||
|
return self._next_episode_edge_ops
|
||||||
|
|
||||||
|
@property
|
||||||
|
def search_ops(self) -> SearchOperations:
|
||||||
|
return self._search_ops
|
||||||
|
|
||||||
|
@property
|
||||||
|
def graph_ops(self) -> GraphMaintenanceOperations:
|
||||||
|
return self._graph_ops
|
||||||
|
|
||||||
|
def _get_graph(self, graph_name: str | None) -> FalkorGraph:
|
||||||
|
# FalkorDB requires a non-None database name for multi-tenant graphs; the default is "default_db"
|
||||||
|
if graph_name is None:
|
||||||
|
graph_name = self._database
|
||||||
|
return self.client.select_graph(graph_name)
|
||||||
|
|
||||||
|
async def execute_query(self, cypher_query_, **kwargs: Any):
|
||||||
|
graph = self._get_graph(self._database)
|
||||||
|
|
||||||
|
# Convert datetime objects to ISO strings (FalkorDB does not support datetime objects directly)
|
||||||
|
params = convert_datetimes_to_strings(dict(kwargs))
|
||||||
|
|
||||||
|
try:
|
||||||
|
result = await graph.query(cypher_query_, params) # type: ignore[reportUnknownArgumentType]
|
||||||
|
except Exception as e:
|
||||||
|
if 'already indexed' in str(e):
|
||||||
|
# check if index already exists
|
||||||
|
logger.info(f'Index already exists: {e}')
|
||||||
|
return None
|
||||||
|
logger.error(f'Error executing FalkorDB query: {e}\n{cypher_query_}\n{params}')
|
||||||
|
raise
|
||||||
|
|
||||||
|
# Convert the result header to a list of strings
|
||||||
|
header = [h[1] for h in result.header]
|
||||||
|
|
||||||
|
# Convert FalkorDB's result format (list of lists) to the format expected by Graphiti (list of dicts)
|
||||||
|
records = []
|
||||||
|
for row in result.result_set:
|
||||||
|
record = {}
|
||||||
|
for i, field_name in enumerate(header):
|
||||||
|
if i < len(row):
|
||||||
|
record[field_name] = row[i]
|
||||||
|
else:
|
||||||
|
# If there are more fields in header than values in row, set to None
|
||||||
|
record[field_name] = None
|
||||||
|
records.append(record)
|
||||||
|
|
||||||
|
return records, header, None
|
||||||
|
|
||||||
|
def session(self, database: str | None = None) -> GraphDriverSession:
|
||||||
|
return FalkorDriverSession(self._get_graph(database))
|
||||||
|
|
||||||
|
async def close(self) -> None:
|
||||||
|
"""Close the driver connection."""
|
||||||
|
if hasattr(self.client, 'aclose'):
|
||||||
|
await self.client.aclose() # type: ignore[reportUnknownMemberType]
|
||||||
|
elif hasattr(self.client.connection, 'aclose'):
|
||||||
|
await self.client.connection.aclose()
|
||||||
|
elif hasattr(self.client.connection, 'close'):
|
||||||
|
await self.client.connection.close()
|
||||||
|
|
||||||
|
async def delete_all_indexes(self) -> None:
|
||||||
|
result = await self.execute_query('CALL db.indexes()')
|
||||||
|
if not result:
|
||||||
|
return
|
||||||
|
|
||||||
|
records, _, _ = result
|
||||||
|
drop_tasks = []
|
||||||
|
|
||||||
|
for record in records:
|
||||||
|
label = record['label']
|
||||||
|
entity_type = record['entitytype']
|
||||||
|
|
||||||
|
for field_name, index_type in record['types'].items():
|
||||||
|
if 'RANGE' in index_type:
|
||||||
|
drop_tasks.append(self.execute_query(f'DROP INDEX ON :{label}({field_name})'))
|
||||||
|
elif 'FULLTEXT' in index_type:
|
||||||
|
if entity_type == 'NODE':
|
||||||
|
drop_tasks.append(
|
||||||
|
self.execute_query(
|
||||||
|
f'DROP FULLTEXT INDEX FOR (n:{label}) ON (n.{field_name})'
|
||||||
|
)
|
||||||
|
)
|
||||||
|
elif entity_type == 'RELATIONSHIP':
|
||||||
|
drop_tasks.append(
|
||||||
|
self.execute_query(
|
||||||
|
f'DROP FULLTEXT INDEX FOR ()-[e:{label}]-() ON (e.{field_name})'
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
if drop_tasks:
|
||||||
|
await asyncio.gather(*drop_tasks)
|
||||||
|
|
||||||
|
async def build_indices_and_constraints(self, delete_existing=False):
|
||||||
|
if delete_existing:
|
||||||
|
await self.delete_all_indexes()
|
||||||
|
# PATCHED 2026-05-02 (BirdAI vendored patch): add vector indexes alongside
|
||||||
|
# range and fulltext. FalkorDB supports native vector indexes via
|
||||||
|
# db.idx.vector.queryNodes / queryRelationships; without these, similarity
|
||||||
|
# search runs as full-table-scan cosine math in interpreted Cypher.
|
||||||
|
index_queries = (
|
||||||
|
get_range_indices(self.provider)
|
||||||
|
+ get_fulltext_indices(self.provider)
|
||||||
|
+ get_vector_indices(self.provider)
|
||||||
|
)
|
||||||
|
for query in index_queries:
|
||||||
|
await self.execute_query(query)
|
||||||
|
# Invalidate the search_ops vector-index existence cache so subsequent
|
||||||
|
# similarity queries re-probe and discover the indexes we just built.
|
||||||
|
try:
|
||||||
|
from graphiti_core.driver.falkordb.operations.search_ops import (
|
||||||
|
_invalidate_falkordb_vector_index_cache,
|
||||||
|
)
|
||||||
|
_invalidate_falkordb_vector_index_cache()
|
||||||
|
except ImportError:
|
||||||
|
# search_ops module not yet imported (cold start); cache is empty
|
||||||
|
# by default, so no invalidation needed.
|
||||||
|
pass
|
||||||
|
|
||||||
|
def clone(self, database: str) -> 'GraphDriver':
|
||||||
|
"""
|
||||||
|
Returns a shallow copy of this driver with a different default database.
|
||||||
|
Reuses the same connection (e.g. FalkorDB, Neo4j).
|
||||||
|
"""
|
||||||
|
if database == self._database:
|
||||||
|
cloned = self
|
||||||
|
elif database == self.default_group_id:
|
||||||
|
cloned = FalkorDriver(falkor_db=self.client)
|
||||||
|
else:
|
||||||
|
# Create a new instance of FalkorDriver with the same connection but a different database
|
||||||
|
cloned = FalkorDriver(falkor_db=self.client, database=database)
|
||||||
|
|
||||||
|
return cloned
|
||||||
|
|
||||||
|
async def health_check(self) -> None:
|
||||||
|
"""Check FalkorDB connectivity by running a simple query."""
|
||||||
|
try:
|
||||||
|
await self.execute_query('MATCH (n) RETURN 1 LIMIT 1')
|
||||||
|
return None
|
||||||
|
except Exception as e:
|
||||||
|
print(f'FalkorDB health check failed: {e}')
|
||||||
|
raise
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def convert_datetimes_to_strings(obj):
|
||||||
|
if isinstance(obj, dict):
|
||||||
|
return {k: FalkorDriver.convert_datetimes_to_strings(v) for k, v in obj.items()}
|
||||||
|
elif isinstance(obj, list):
|
||||||
|
return [FalkorDriver.convert_datetimes_to_strings(item) for item in obj]
|
||||||
|
elif isinstance(obj, tuple):
|
||||||
|
return tuple(FalkorDriver.convert_datetimes_to_strings(item) for item in obj)
|
||||||
|
elif isinstance(obj, datetime):
|
||||||
|
return obj.isoformat()
|
||||||
|
else:
|
||||||
|
return obj
|
||||||
|
|
||||||
|
def sanitize(self, query: str) -> str:
|
||||||
|
"""
|
||||||
|
Replace FalkorDB special characters with whitespace.
|
||||||
|
Based on FalkorDB tokenization rules: ,.<>{}[]"':;!@#$%^&*()-+=~
|
||||||
|
"""
|
||||||
|
# FalkorDB separator characters that break text into tokens
|
||||||
|
separator_map = str.maketrans(
|
||||||
|
{
|
||||||
|
',': ' ',
|
||||||
|
'.': ' ',
|
||||||
|
'<': ' ',
|
||||||
|
'>': ' ',
|
||||||
|
'{': ' ',
|
||||||
|
'}': ' ',
|
||||||
|
'[': ' ',
|
||||||
|
']': ' ',
|
||||||
|
'"': ' ',
|
||||||
|
"'": ' ',
|
||||||
|
':': ' ',
|
||||||
|
';': ' ',
|
||||||
|
'!': ' ',
|
||||||
|
'@': ' ',
|
||||||
|
'#': ' ',
|
||||||
|
'$': ' ',
|
||||||
|
'%': ' ',
|
||||||
|
'^': ' ',
|
||||||
|
'&': ' ',
|
||||||
|
'*': ' ',
|
||||||
|
'(': ' ',
|
||||||
|
')': ' ',
|
||||||
|
'-': ' ',
|
||||||
|
'+': ' ',
|
||||||
|
'=': ' ',
|
||||||
|
'~': ' ',
|
||||||
|
'?': ' ',
|
||||||
|
'|': ' ',
|
||||||
|
'/': ' ',
|
||||||
|
'\\': ' ',
|
||||||
|
}
|
||||||
|
)
|
||||||
|
sanitized = query.translate(separator_map)
|
||||||
|
# Clean up multiple spaces
|
||||||
|
sanitized = ' '.join(sanitized.split())
|
||||||
|
return sanitized
|
||||||
|
|
||||||
|
def build_fulltext_query(
|
||||||
|
self, query: str, group_ids: list[str] | None = None, max_query_length: int = 128
|
||||||
|
) -> str:
|
||||||
|
"""
|
||||||
|
Build a fulltext query string for FalkorDB using RedisSearch syntax.
|
||||||
|
FalkorDB uses RedisSearch-like syntax where:
|
||||||
|
- Field queries use @ prefix: @field:value
|
||||||
|
- Multiple values for same field: (@field:value1|value2)
|
||||||
|
- Text search doesn't need @ prefix for content fields
|
||||||
|
- AND is implicit with space: (@group_id:value) (text)
|
||||||
|
- OR uses pipe within parentheses: (@group_id:value1|value2)
|
||||||
|
"""
|
||||||
|
validate_group_ids(group_ids)
|
||||||
|
|
||||||
|
if group_ids is None or len(group_ids) == 0:
|
||||||
|
group_filter = ''
|
||||||
|
else:
|
||||||
|
# Escape group_ids with quotes to prevent RediSearch syntax errors
|
||||||
|
# with reserved words like "main" or special characters like hyphens
|
||||||
|
escaped_group_ids = [f'"{gid}"' for gid in group_ids]
|
||||||
|
group_values = '|'.join(escaped_group_ids)
|
||||||
|
group_filter = f'(@group_id:{group_values})'
|
||||||
|
|
||||||
|
sanitized_query = self.sanitize(query)
|
||||||
|
|
||||||
|
# Remove stopwords and empty tokens from the sanitized query
|
||||||
|
query_words = sanitized_query.split()
|
||||||
|
filtered_words = [word for word in query_words if word and word.lower() not in STOPWORDS]
|
||||||
|
sanitized_query = ' | '.join(filtered_words)
|
||||||
|
|
||||||
|
# If the query is too long return no query
|
||||||
|
if len(sanitized_query.split(' ')) + len(group_ids or '') >= max_query_length:
|
||||||
|
return ''
|
||||||
|
|
||||||
|
full_query = group_filter + ' (' + sanitized_query + ')'
|
||||||
|
|
||||||
|
return full_query
|
||||||
@@ -0,0 +1,242 @@
|
|||||||
|
"""
|
||||||
|
Database query utilities for different graph database backends.
|
||||||
|
|
||||||
|
This module provides database-agnostic query generation for Neo4j and FalkorDB,
|
||||||
|
supporting index creation, fulltext search, and bulk operations.
|
||||||
|
|
||||||
|
PATCHED for FalkorDB native vector index support (BirdAI vendored patch,
|
||||||
|
2026-05-02). Adds:
|
||||||
|
- get_vector_indices(): CREATE VECTOR INDEX statements for FalkorDB
|
||||||
|
- get_vector_search_query(): Cypher fragment for vector similarity using
|
||||||
|
FalkorDB's db.idx.vector procedures, with fallback to cosine math when
|
||||||
|
the index does not yet exist
|
||||||
|
- VECTOR_INDEX_CANDIDATE_MULTIPLIER: over-fetch factor for vector index
|
||||||
|
queries to handle filter rejections after index lookup
|
||||||
|
|
||||||
|
No changes to Neo4j or Kuzu code paths.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from typing_extensions import LiteralString
|
||||||
|
|
||||||
|
from graphiti_core.driver.driver import GraphProvider
|
||||||
|
|
||||||
|
# Mapping from Neo4j fulltext index names to FalkorDB node labels
|
||||||
|
NEO4J_TO_FALKORDB_MAPPING = {
|
||||||
|
'node_name_and_summary': 'Entity',
|
||||||
|
'community_name': 'Community',
|
||||||
|
'episode_content': 'Episodic',
|
||||||
|
'edge_name_and_fact': 'RELATES_TO',
|
||||||
|
}
|
||||||
|
# Mapping from fulltext index names to Kuzu node labels
|
||||||
|
INDEX_TO_LABEL_KUZU_MAPPING = {
|
||||||
|
'node_name_and_summary': 'Entity',
|
||||||
|
'community_name': 'Community',
|
||||||
|
'episode_content': 'Episodic',
|
||||||
|
'edge_name_and_fact': 'RelatesToNode_',
|
||||||
|
}
|
||||||
|
|
||||||
|
# Vector index over-fetch multiplier. When a vector index search is
|
||||||
|
# combined with WHERE filters (group_id, source_uuid, etc.), some of
|
||||||
|
# the top-k index results may be filtered out. Over-fetching by this
|
||||||
|
# factor preserves recall against the final LIMIT after filtering.
|
||||||
|
# Conservative default; tunable per-deployment by editing this constant
|
||||||
|
# or via environment-variable override at the driver level (future).
|
||||||
|
VECTOR_INDEX_CANDIDATE_MULTIPLIER = 5
|
||||||
|
|
||||||
|
|
||||||
|
def get_range_indices(provider: GraphProvider) -> list[LiteralString]:
|
||||||
|
if provider == GraphProvider.FALKORDB:
|
||||||
|
return [
|
||||||
|
# Entity node
|
||||||
|
'CREATE INDEX FOR (n:Entity) ON (n.uuid, n.group_id, n.name, n.created_at)',
|
||||||
|
# Episodic node
|
||||||
|
'CREATE INDEX FOR (n:Episodic) ON (n.uuid, n.group_id, n.created_at, n.valid_at)',
|
||||||
|
# Community node
|
||||||
|
'CREATE INDEX FOR (n:Community) ON (n.uuid)',
|
||||||
|
# Saga node
|
||||||
|
'CREATE INDEX FOR (n:Saga) ON (n.uuid, n.group_id, n.name)',
|
||||||
|
# RELATES_TO edge
|
||||||
|
'CREATE INDEX FOR ()-[e:RELATES_TO]-() ON (e.uuid, e.group_id, e.name, e.created_at, e.expired_at, e.valid_at, e.invalid_at)',
|
||||||
|
# MENTIONS edge
|
||||||
|
'CREATE INDEX FOR ()-[e:MENTIONS]-() ON (e.uuid, e.group_id)',
|
||||||
|
# HAS_MEMBER edge
|
||||||
|
'CREATE INDEX FOR ()-[e:HAS_MEMBER]-() ON (e.uuid)',
|
||||||
|
# HAS_EPISODE edge
|
||||||
|
'CREATE INDEX FOR ()-[e:HAS_EPISODE]-() ON (e.uuid, e.group_id)',
|
||||||
|
# NEXT_EPISODE edge
|
||||||
|
'CREATE INDEX FOR ()-[e:NEXT_EPISODE]-() ON (e.uuid, e.group_id)',
|
||||||
|
]
|
||||||
|
|
||||||
|
if provider == GraphProvider.KUZU:
|
||||||
|
return []
|
||||||
|
|
||||||
|
return [
|
||||||
|
'CREATE INDEX entity_uuid IF NOT EXISTS FOR (n:Entity) ON (n.uuid)',
|
||||||
|
'CREATE INDEX episode_uuid IF NOT EXISTS FOR (n:Episodic) ON (n.uuid)',
|
||||||
|
'CREATE INDEX community_uuid IF NOT EXISTS FOR (n:Community) ON (n.uuid)',
|
||||||
|
'CREATE INDEX saga_uuid IF NOT EXISTS FOR (n:Saga) ON (n.uuid)',
|
||||||
|
'CREATE INDEX relation_uuid IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.uuid)',
|
||||||
|
'CREATE INDEX mention_uuid IF NOT EXISTS FOR ()-[e:MENTIONS]-() ON (e.uuid)',
|
||||||
|
'CREATE INDEX has_member_uuid IF NOT EXISTS FOR ()-[e:HAS_MEMBER]-() ON (e.uuid)',
|
||||||
|
'CREATE INDEX has_episode_uuid IF NOT EXISTS FOR ()-[e:HAS_EPISODE]-() ON (e.uuid)',
|
||||||
|
'CREATE INDEX next_episode_uuid IF NOT EXISTS FOR ()-[e:NEXT_EPISODE]-() ON (e.uuid)',
|
||||||
|
'CREATE INDEX entity_group_id IF NOT EXISTS FOR (n:Entity) ON (n.group_id)',
|
||||||
|
'CREATE INDEX episode_group_id IF NOT EXISTS FOR (n:Episodic) ON (n.group_id)',
|
||||||
|
'CREATE INDEX community_group_id IF NOT EXISTS FOR (n:Community) ON (n.group_id)',
|
||||||
|
'CREATE INDEX saga_group_id IF NOT EXISTS FOR (n:Saga) ON (n.group_id)',
|
||||||
|
'CREATE INDEX relation_group_id IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.group_id)',
|
||||||
|
'CREATE INDEX mention_group_id IF NOT EXISTS FOR ()-[e:MENTIONS]-() ON (e.group_id)',
|
||||||
|
'CREATE INDEX has_episode_group_id IF NOT EXISTS FOR ()-[e:HAS_EPISODE]-() ON (e.group_id)',
|
||||||
|
'CREATE INDEX next_episode_group_id IF NOT EXISTS FOR ()-[e:NEXT_EPISODE]-() ON (e.group_id)',
|
||||||
|
'CREATE INDEX name_entity_index IF NOT EXISTS FOR (n:Entity) ON (n.name)',
|
||||||
|
'CREATE INDEX saga_name IF NOT EXISTS FOR (n:Saga) ON (n.name)',
|
||||||
|
'CREATE INDEX created_at_entity_index IF NOT EXISTS FOR (n:Entity) ON (n.created_at)',
|
||||||
|
'CREATE INDEX created_at_episodic_index IF NOT EXISTS FOR (n:Episodic) ON (n.created_at)',
|
||||||
|
'CREATE INDEX valid_at_episodic_index IF NOT EXISTS FOR (n:Episodic) ON (n.valid_at)',
|
||||||
|
'CREATE INDEX name_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.name)',
|
||||||
|
'CREATE INDEX created_at_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.created_at)',
|
||||||
|
'CREATE INDEX expired_at_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.expired_at)',
|
||||||
|
'CREATE INDEX valid_at_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.valid_at)',
|
||||||
|
'CREATE INDEX invalid_at_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.invalid_at)',
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def get_fulltext_indices(provider: GraphProvider) -> list[LiteralString]:
|
||||||
|
if provider == GraphProvider.FALKORDB:
|
||||||
|
from typing import cast
|
||||||
|
|
||||||
|
from graphiti_core.driver.falkordb import STOPWORDS
|
||||||
|
|
||||||
|
# Convert to string representation for embedding in queries
|
||||||
|
stopwords_str = str(STOPWORDS)
|
||||||
|
|
||||||
|
# Use type: ignore to satisfy LiteralString requirement while maintaining single source of truth
|
||||||
|
return cast(
|
||||||
|
list[LiteralString],
|
||||||
|
[
|
||||||
|
f"""CALL db.idx.fulltext.createNodeIndex(
|
||||||
|
{{
|
||||||
|
label: 'Episodic',
|
||||||
|
stopwords: {stopwords_str}
|
||||||
|
}},
|
||||||
|
'content', 'source', 'source_description', 'group_id'
|
||||||
|
)""",
|
||||||
|
f"""CALL db.idx.fulltext.createNodeIndex(
|
||||||
|
{{
|
||||||
|
label: 'Entity',
|
||||||
|
stopwords: {stopwords_str}
|
||||||
|
}},
|
||||||
|
'name', 'summary', 'group_id'
|
||||||
|
)""",
|
||||||
|
f"""CALL db.idx.fulltext.createNodeIndex(
|
||||||
|
{{
|
||||||
|
label: 'Community',
|
||||||
|
stopwords: {stopwords_str}
|
||||||
|
}},
|
||||||
|
'name', 'group_id'
|
||||||
|
)""",
|
||||||
|
"""CREATE FULLTEXT INDEX FOR ()-[e:RELATES_TO]-() ON (e.name, e.fact, e.group_id)""",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
if provider == GraphProvider.KUZU:
|
||||||
|
return [
|
||||||
|
"CALL CREATE_FTS_INDEX('Episodic', 'episode_content', ['content', 'source', 'source_description']);",
|
||||||
|
"CALL CREATE_FTS_INDEX('Entity', 'node_name_and_summary', ['name', 'summary']);",
|
||||||
|
"CALL CREATE_FTS_INDEX('Community', 'community_name', ['name']);",
|
||||||
|
"CALL CREATE_FTS_INDEX('RelatesToNode_', 'edge_name_and_fact', ['name', 'fact']);",
|
||||||
|
]
|
||||||
|
|
||||||
|
return [
|
||||||
|
"""CREATE FULLTEXT INDEX episode_content IF NOT EXISTS
|
||||||
|
FOR (e:Episodic) ON EACH [e.content, e.source, e.source_description, e.group_id]""",
|
||||||
|
"""CREATE FULLTEXT INDEX node_name_and_summary IF NOT EXISTS
|
||||||
|
FOR (n:Entity) ON EACH [n.name, n.summary, n.group_id]""",
|
||||||
|
"""CREATE FULLTEXT INDEX community_name IF NOT EXISTS
|
||||||
|
FOR (n:Community) ON EACH [n.name, n.group_id]""",
|
||||||
|
"""CREATE FULLTEXT INDEX edge_name_and_fact IF NOT EXISTS
|
||||||
|
FOR ()-[e:RELATES_TO]-() ON EACH [e.name, e.fact, e.group_id]""",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def get_vector_indices(provider: GraphProvider, dimension: int = 384) -> list[LiteralString]:
|
||||||
|
"""Return CREATE VECTOR INDEX statements for the given provider.
|
||||||
|
|
||||||
|
For FalkorDB: creates HNSW vector indexes on Entity.name_embedding,
|
||||||
|
RELATES_TO.fact_embedding, and Community.name_embedding. Backed by
|
||||||
|
FalkorDB's native vector index (db.idx.vector.queryNodes /
|
||||||
|
queryRelationships).
|
||||||
|
|
||||||
|
For Neo4j and Kuzu: returns an empty list. Those backends create vector
|
||||||
|
indexes via different mechanisms (Neo4j auto-creates them when needed
|
||||||
|
via its vector.similarity.cosine function; Kuzu uses array_cosine_similarity
|
||||||
|
and does not require pre-built vector indexes for graphiti-core's usage).
|
||||||
|
|
||||||
|
Args:
|
||||||
|
provider: The graph database provider.
|
||||||
|
dimension: Embedding dimension. Defaults to 384 (all-MiniLM-L6-v2).
|
||||||
|
Embedders with different dimensions should pass their own value
|
||||||
|
through driver configuration. graphiti-core's default embedder
|
||||||
|
is 1536 (OpenAI ada-002); BirdAI uses 384 (sentence-transformers).
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of CREATE VECTOR INDEX statements. Idempotent at FalkorDB level
|
||||||
|
if the index already exists with matching options.
|
||||||
|
"""
|
||||||
|
if provider == GraphProvider.FALKORDB:
|
||||||
|
from typing import cast
|
||||||
|
return cast(
|
||||||
|
list[LiteralString],
|
||||||
|
[
|
||||||
|
f"CREATE VECTOR INDEX FOR (n:Entity) ON (n.name_embedding) "
|
||||||
|
f"OPTIONS {{dimension: {dimension}, similarityFunction: 'cosine'}}",
|
||||||
|
f"CREATE VECTOR INDEX FOR ()-[e:RELATES_TO]-() ON (e.fact_embedding) "
|
||||||
|
f"OPTIONS {{dimension: {dimension}, similarityFunction: 'cosine'}}",
|
||||||
|
f"CREATE VECTOR INDEX FOR (n:Community) ON (n.name_embedding) "
|
||||||
|
f"OPTIONS {{dimension: {dimension}, similarityFunction: 'cosine'}}",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
return []
|
||||||
|
|
||||||
|
|
||||||
|
def get_nodes_query(name: str, query: str, limit: int, provider: GraphProvider) -> str:
|
||||||
|
if provider == GraphProvider.FALKORDB:
|
||||||
|
label = NEO4J_TO_FALKORDB_MAPPING[name]
|
||||||
|
return f"CALL db.idx.fulltext.queryNodes('{label}', {query})"
|
||||||
|
|
||||||
|
if provider == GraphProvider.KUZU:
|
||||||
|
label = INDEX_TO_LABEL_KUZU_MAPPING[name]
|
||||||
|
return f"CALL QUERY_FTS_INDEX('{label}', '{name}', {query}, TOP := $limit)"
|
||||||
|
|
||||||
|
return f'CALL db.index.fulltext.queryNodes("{name}", {query}, {{limit: $limit}})'
|
||||||
|
|
||||||
|
|
||||||
|
def get_vector_cosine_func_query(vec1, vec2, provider: GraphProvider) -> str:
|
||||||
|
"""Return a Cypher fragment for cosine similarity score in [0, 1].
|
||||||
|
|
||||||
|
PRESERVED for backward compatibility and as fallback when vector indexes
|
||||||
|
do not yet exist on the FalkorDB backend. New code paths should prefer
|
||||||
|
get_vector_search_query() which uses the native vector index when
|
||||||
|
available.
|
||||||
|
"""
|
||||||
|
if provider == GraphProvider.FALKORDB:
|
||||||
|
# FalkorDB uses a different syntax for regular cosine similarity and Neo4j uses normalized cosine similarity
|
||||||
|
return f'(2 - vec.cosineDistance({vec1}, vecf32({vec2})))/2'
|
||||||
|
|
||||||
|
if provider == GraphProvider.KUZU:
|
||||||
|
return f'array_cosine_similarity({vec1}, {vec2})'
|
||||||
|
|
||||||
|
return f'vector.similarity.cosine({vec1}, {vec2})'
|
||||||
|
|
||||||
|
|
||||||
|
def get_relationships_query(name: str, limit: int, provider: GraphProvider) -> str:
|
||||||
|
if provider == GraphProvider.FALKORDB:
|
||||||
|
label = NEO4J_TO_FALKORDB_MAPPING[name]
|
||||||
|
return f"CALL db.idx.fulltext.queryRelationships('{label}', $query)"
|
||||||
|
|
||||||
|
if provider == GraphProvider.KUZU:
|
||||||
|
label = INDEX_TO_LABEL_KUZU_MAPPING[name]
|
||||||
|
return f"CALL QUERY_FTS_INDEX('{label}', '{name}', cast($query AS STRING), TOP := $limit)"
|
||||||
|
|
||||||
|
return f'CALL db.index.fulltext.queryRelationships("{name}", $query, {{limit: $limit}})'
|
||||||
+602
-78
@@ -1,12 +1,14 @@
|
|||||||
import os
|
import os
|
||||||
|
import re
|
||||||
import json
|
import json
|
||||||
import sqlite3
|
import sqlite3
|
||||||
import subprocess
|
import subprocess
|
||||||
import hashlib
|
import hashlib
|
||||||
|
import requests
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from datetime import datetime
|
from datetime import datetime, timedelta
|
||||||
from dotenv import load_dotenv
|
from dotenv import load_dotenv
|
||||||
from sentence_transformers import SentenceTransformer
|
from sentence_transformers import SentenceTransformer, CrossEncoder
|
||||||
import anthropic
|
import anthropic
|
||||||
from fastapi import FastAPI, Request, Response, Depends, HTTPException, BackgroundTasks
|
from fastapi import FastAPI, Request, Response, Depends, HTTPException, BackgroundTasks
|
||||||
import psycopg2
|
import psycopg2
|
||||||
@@ -91,6 +93,7 @@ if HAS_WHISPER:
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"Whisper not available: {e}")
|
print(f"Whisper not available: {e}")
|
||||||
embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
||||||
|
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
|
||||||
# ChromaDB removed — using pgvector
|
# ChromaDB removed — using pgvector
|
||||||
anthropic_client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
|
anthropic_client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
|
||||||
|
|
||||||
@@ -121,17 +124,59 @@ economical, specific, never performative. When answering questions,
|
|||||||
cite sources and acknowledge uncertainty rather than filling gaps with
|
cite sources and acknowledge uncertainty rather than filling gaps with
|
||||||
plausible-sounding content.
|
plausible-sounding content.
|
||||||
|
|
||||||
You have access to his complete document corpus, conversation history,
|
You have a persistent memory file (always present below) that carries
|
||||||
and a persistent memory file that carries his current context. Treat
|
Aaron's current context — treat it as ground truth for his present
|
||||||
the memory file as ground truth for his present situation. Use web
|
situation.
|
||||||
search automatically when current information is needed. Never
|
|
||||||
re-brief on context that's already in memory or documents.
|
For anything beyond what's in memory, you have a retrieve_documents
|
||||||
|
tool that searches his full knowledge base: personal documents,
|
||||||
|
reading library, conversation transcripts, and journal entries. Call
|
||||||
|
it whenever you need concrete information — names, dates, project
|
||||||
|
specifics, prior thinking, exhibition records, syllabi, anything you
|
||||||
|
don't already know. For compound questions, call it multiple times
|
||||||
|
with different concrete queries; one call per distinct information
|
||||||
|
need. Prefer specific tokens (named entities, project names, course
|
||||||
|
codes) over abstract instructional phrasing — search "FWN3D
|
||||||
|
consulting" not "my work." Results are unfiltered and ranked by
|
||||||
|
semantic similarity; judge each chunk for relevance and ignore
|
||||||
|
irrelevant hits rather than forcing them into the answer.
|
||||||
|
|
||||||
|
You also have a search_facts tool that queries a knowledge graph of
|
||||||
|
atomic facts about Aaron's entities and their relationships. The graph
|
||||||
|
was populated through early May 2026 and is not currently being
|
||||||
|
updated; treat it as a *historical* layer that holds biographical
|
||||||
|
content (career, projects, consulting), exhibition records, key
|
||||||
|
people, dossier-era claims, and time-stamped facts with explicit
|
||||||
|
validity windows. For biographical or relational questions ("write
|
||||||
|
me a bio", "what's the FWN3D / HVAMC relationship", "who did I
|
||||||
|
consult for at IBM"), call search_facts *in addition to*
|
||||||
|
retrieve_documents — the two return complementary shapes (atomic
|
||||||
|
facts vs. document passages). For current-state questions, the
|
||||||
|
persistent memory file is more authoritative than the graph.
|
||||||
|
|
||||||
|
When Aaron asks for a document file — bio, cover letter, statement,
|
||||||
|
CV section, anything he wants to send or edit outside chat — produce
|
||||||
|
the full text as your chat reply first. NEVER call save_document on
|
||||||
|
the same turn as the initial request, even when Aaron's phrasing
|
||||||
|
includes words like "save", "output", "write", or "as docx/pdf" in
|
||||||
|
the original ask. Those are part of the topic, not a save approval.
|
||||||
|
The first call to save_document only happens in a *later* turn,
|
||||||
|
after Aaron has read the draft and explicitly approves it — examples:
|
||||||
|
"save it", "yes save it", "looks good, write it out", "go ahead".
|
||||||
|
If Aaron asks for revisions, iterate in chat without calling
|
||||||
|
save_document. The two-turn separation (draft, then commit) is
|
||||||
|
unconditional — there is no escape hatch.
|
||||||
|
|
||||||
|
Use web search automatically when current external information is
|
||||||
|
needed. Never re-brief on context that's already in memory or
|
||||||
|
retrieved chunks.
|
||||||
|
|
||||||
When making factual claims about Aaron — his history, credentials, locations, dates, relationships, projects, or any specific event — you must ground the claim in a specific retrieved document or the memory file. Cite the source by name inline. If no source supports the claim, say so explicitly rather than filling the gap with plausible-sounding content. Do not confabulate. If you are inferring rather than citing, mark it as inference."""
|
When making factual claims about Aaron — his history, credentials, locations, dates, relationships, projects, or any specific event — you must ground the claim in a specific retrieved document or the memory file. Cite the source by name inline. If no source supports the claim, say so explicitly rather than filling the gap with plausible-sounding content. Do not confabulate. If you are inferring rather than citing, mark it as inference."""
|
||||||
|
|
||||||
# Auth configuration
|
# Auth configuration
|
||||||
import os
|
import os
|
||||||
SESSION_PASSWORD = os.getenv("AARON_AI_PASSWORD", "changeme")
|
SESSION_PASSWORD = os.getenv("AARON_AI_PASSWORD", "changeme")
|
||||||
|
SESSION_MAX_AGE_SECONDS = 60 * 60 * 24 * 365
|
||||||
SESSIONS_DB = str(Path.home() / "aaronai" / "sessions.db")
|
SESSIONS_DB = str(Path.home() / "aaronai" / "sessions.db")
|
||||||
|
|
||||||
def _init_sessions():
|
def _init_sessions():
|
||||||
@@ -163,7 +208,10 @@ def delete_session(token: str):
|
|||||||
|
|
||||||
def session_exists(token: str) -> bool:
|
def session_exists(token: str) -> bool:
|
||||||
conn = _connect_sessions()
|
conn = _connect_sessions()
|
||||||
row = conn.execute("SELECT 1 FROM sessions WHERE token = ?", (token,)).fetchone()
|
cutoff = (datetime.now() - timedelta(seconds=SESSION_MAX_AGE_SECONDS)).isoformat()
|
||||||
|
conn.execute("DELETE FROM sessions WHERE created_at < ?", (cutoff,))
|
||||||
|
conn.commit()
|
||||||
|
row = conn.execute("SELECT 1 FROM sessions WHERE token = ? AND created_at >= ?", (token, cutoff)).fetchone()
|
||||||
conn.close()
|
conn.close()
|
||||||
return row is not None
|
return row is not None
|
||||||
|
|
||||||
@@ -239,30 +287,127 @@ def remove_from_memory(item):
|
|||||||
save_memory("\n".join(filtered))
|
save_memory("\n".join(filtered))
|
||||||
return len(lines) - len(filtered)
|
return len(lines) - len(filtered)
|
||||||
|
|
||||||
def retrieve_context(query, n_results=8):
|
HYBRID_CANDIDATES = 30
|
||||||
"""Pure semantic retrieval over pgvector. Top-N by cosine similarity, threshold 0.3.
|
RRF_K = 60
|
||||||
No CV pinning, no keyword routing — see architecture doc substrate-dependency section.
|
FINAL_LIMIT = 8
|
||||||
Substrate-level workarounds (entity-keyed routing, hybrid retrieval) live at the
|
MAX_RETRIEVALS_PER_TURN = 5
|
||||||
Graphiti layer, not as wrapper logic above pgvector."""
|
MAX_CITED_SOURCES = 5
|
||||||
|
|
||||||
|
_TSQUERY_SANITIZE_RE = re.compile(r"[^\w\s\"'-]")
|
||||||
|
|
||||||
|
|
||||||
|
def _websearch_query(text: str) -> str:
|
||||||
|
"""Strip characters websearch_to_tsquery doesn't handle cleanly. Quoted
|
||||||
|
phrases and 'or' are preserved by the function itself."""
|
||||||
|
return _TSQUERY_SANITIZE_RE.sub(" ", text).strip()
|
||||||
|
|
||||||
|
|
||||||
|
def _rerank(query: str, candidates: list[tuple]) -> list[tuple]:
|
||||||
|
"""Cross-encoder rerank. Candidates are (id, document, source, folder, created_at)
|
||||||
|
tuples. Returns the same tuples reordered by reranker score with created_at as
|
||||||
|
secondary key — so when two chunks score similarly the newer one wins, which
|
||||||
|
keeps memory/journal files biased toward the latest snapshot."""
|
||||||
|
if not candidates:
|
||||||
|
return []
|
||||||
|
pairs = [(query, row[1]) for row in candidates]
|
||||||
|
scores = reranker.predict(pairs)
|
||||||
|
return [row for row, _ in sorted(
|
||||||
|
zip(candidates, scores),
|
||||||
|
key=lambda x: (float(x[1]), x[0][4] or ""),
|
||||||
|
reverse=True,
|
||||||
|
)]
|
||||||
|
|
||||||
|
|
||||||
|
def _format_source(source: str, folder: str) -> str:
|
||||||
|
"""Surface folder context to the LLM so it can disambiguate same-named files
|
||||||
|
(e.g., 21 different CV.docx files across job-application folders)."""
|
||||||
|
source = source or "unknown"
|
||||||
|
if folder and folder not in ("", "."):
|
||||||
|
return f"{folder}/{source}"
|
||||||
|
return source
|
||||||
|
|
||||||
|
|
||||||
|
def _dedup_key(doc: str) -> str:
|
||||||
|
"""Collapse near-duplicates by content. Files copied to multiple folders
|
||||||
|
produce byte-identical chunks; this catches those without affecting
|
||||||
|
legitimately-different chunks of the same source (e.g., separate sections
|
||||||
|
of a conversation)."""
|
||||||
|
return hashlib.md5(doc[:300].lower().encode("utf-8", "ignore")).hexdigest()
|
||||||
|
|
||||||
|
|
||||||
|
def retrieve_context(query, n_results=FINAL_LIMIT):
|
||||||
|
"""Hybrid retrieval (dense + lexical, RRF fused) followed by cross-encoder rerank.
|
||||||
|
|
||||||
|
- Dense (pgvector) handles paraphrase / semantic similarity.
|
||||||
|
- Lexical (tsvector) catches rare named tokens (FWN3D, Sono-Tek, course codes)
|
||||||
|
the embedding model has no signal for.
|
||||||
|
- RRF combines the two rankings without calibrating score scales.
|
||||||
|
- Cross-encoder rerank scores each (query, chunk) pair jointly.
|
||||||
|
- Near-duplicate collapse on output so top-N slots aren't burned by
|
||||||
|
multi-folder copies of the same file.
|
||||||
|
|
||||||
|
No type or folder filtering: imposing a taxonomy at retrieval time is a
|
||||||
|
heuristic we've explicitly rejected. The reranker ranks, the caller (LLM)
|
||||||
|
decides what's relevant to its task."""
|
||||||
query_embedding = embedder.encode([query]).tolist()[0]
|
query_embedding = embedder.encode([query]).tolist()[0]
|
||||||
|
ts_query = _websearch_query(query)
|
||||||
|
|
||||||
context_pieces = []
|
context_pieces = []
|
||||||
sources = []
|
sources = []
|
||||||
|
|
||||||
try:
|
try:
|
||||||
pg = get_pg()
|
pg = get_pg()
|
||||||
cur = pg.cursor()
|
cur = pg.cursor()
|
||||||
|
|
||||||
cur.execute("""
|
cur.execute("""
|
||||||
SELECT document, source, 1 - (embedding <=> %s::vector) as similarity
|
SELECT id, document, source, metadata->>'folder' AS folder, created_at
|
||||||
FROM embeddings
|
FROM embeddings
|
||||||
ORDER BY embedding <=> %s::vector
|
ORDER BY embedding <=> %s::vector
|
||||||
LIMIT %s
|
LIMIT %s
|
||||||
""", (query_embedding, query_embedding, n_results))
|
""", (query_embedding, HYBRID_CANDIDATES))
|
||||||
for doc, source, similarity in cur.fetchall():
|
dense_hits = cur.fetchall()
|
||||||
if similarity > 0.3:
|
|
||||||
context_pieces.append(doc)
|
lexical_hits = []
|
||||||
sources.append(source or "unknown")
|
if ts_query:
|
||||||
|
cur.execute("""
|
||||||
|
SELECT id, document, source, metadata->>'folder' AS folder, created_at
|
||||||
|
FROM embeddings
|
||||||
|
WHERE to_tsvector('english', document)
|
||||||
|
@@ websearch_to_tsquery('english', %s)
|
||||||
|
ORDER BY ts_rank(to_tsvector('english', document),
|
||||||
|
websearch_to_tsquery('english', %s)) DESC
|
||||||
|
LIMIT %s
|
||||||
|
""", (ts_query, ts_query, HYBRID_CANDIDATES))
|
||||||
|
lexical_hits = cur.fetchall()
|
||||||
|
|
||||||
pg.close()
|
pg.close()
|
||||||
|
|
||||||
|
scores = {}
|
||||||
|
rows_by_id = {}
|
||||||
|
for rank, row in enumerate(dense_hits):
|
||||||
|
scores[row[0]] = scores.get(row[0], 0) + 1.0 / (RRF_K + rank + 1)
|
||||||
|
rows_by_id[row[0]] = row
|
||||||
|
for rank, row in enumerate(lexical_hits):
|
||||||
|
scores[row[0]] = scores.get(row[0], 0) + 1.0 / (RRF_K + rank + 1)
|
||||||
|
rows_by_id[row[0]] = row
|
||||||
|
|
||||||
|
rrf_ranked = sorted(scores.items(), key=lambda kv: kv[1], reverse=True)
|
||||||
|
candidates = [rows_by_id[doc_id] for doc_id, _ in rrf_ranked]
|
||||||
|
|
||||||
|
seen = set()
|
||||||
|
for _id, doc, source, folder, _created_at in _rerank(query, candidates):
|
||||||
|
key = _dedup_key(doc)
|
||||||
|
if key in seen:
|
||||||
|
continue
|
||||||
|
seen.add(key)
|
||||||
|
context_pieces.append(doc)
|
||||||
|
sources.append(_format_source(source, folder))
|
||||||
|
if len(context_pieces) >= n_results:
|
||||||
|
break
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"pgvector retrieval error: {e}")
|
print(f"hybrid retrieval error: {e}")
|
||||||
|
|
||||||
return context_pieces, sources
|
return context_pieces, sources
|
||||||
|
|
||||||
def get_conversation_history(conversation_id, limit=20):
|
def get_conversation_history(conversation_id, limit=20):
|
||||||
@@ -300,50 +445,370 @@ def create_conversation(title="New conversation"):
|
|||||||
conn.close()
|
conn.close()
|
||||||
return conv_id
|
return conv_id
|
||||||
|
|
||||||
|
NEXTCLOUD_URL = os.getenv("NEXTCLOUD_URL", "https://nextcloud.aaronnelson.studio")
|
||||||
|
NEXTCLOUD_USER = os.getenv("NEXTCLOUD_USER", "aaron")
|
||||||
|
NEXTCLOUD_PASSWORD = os.getenv("NEXTCLOUD_PASSWORD", "")
|
||||||
|
DRAFTS_WEBDAV = f"{NEXTCLOUD_URL}/remote.php/dav/files/{NEXTCLOUD_USER}/Drafts"
|
||||||
|
|
||||||
|
_FILENAME_SAFE_RE = re.compile(r"[^A-Za-z0-9_\-\. ]")
|
||||||
|
|
||||||
|
|
||||||
|
GRAPHITI_URL = os.getenv("GRAPHITI_URL", "http://localhost:8001")
|
||||||
|
GRAPHITI_GROUP_ID = os.getenv("GRAPHITI_GROUP_ID", "aaron")
|
||||||
|
|
||||||
|
|
||||||
|
SEARCH_FACTS_TOOL = {
|
||||||
|
"name": "search_facts",
|
||||||
|
"description": (
|
||||||
|
"Search Aaron's knowledge graph for atomic facts about entities and "
|
||||||
|
"their relationships. The graph holds time-stamped facts captured up "
|
||||||
|
"to early May 2026 — biographical content (career, projects, "
|
||||||
|
"consulting), exhibition history, key relationships, dossier-era "
|
||||||
|
"claims. Returns short sentence-shaped facts with valid_at / "
|
||||||
|
"invalid_at timestamps so you can distinguish current state from "
|
||||||
|
"superseded history. Useful for: bios, 'who did I consult for', "
|
||||||
|
"'what's the relationship between X and Y', any question shaped like "
|
||||||
|
"a relational lookup. Complements retrieve_documents (which returns "
|
||||||
|
"longer chunk passages). Call this *in addition to* retrieve_documents "
|
||||||
|
"for biographical or relational questions — the two return "
|
||||||
|
"different shapes of evidence. The graph hasn't been updated since "
|
||||||
|
"early May 2026; for current-state questions, the persistent memory "
|
||||||
|
"file or recent documents are more authoritative."
|
||||||
|
),
|
||||||
|
"input_schema": {
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"query": {
|
||||||
|
"type": "string",
|
||||||
|
"description": "The fact-shaped query. Concrete entity names work best.",
|
||||||
|
},
|
||||||
|
},
|
||||||
|
"required": ["query"],
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _push_chat_turn_to_graphiti(conversation_id, user_message, assistant_message):
|
||||||
|
"""Async fire-and-forget push of a chat turn into Graphiti. Single episode,
|
||||||
|
default extraction, no custom_extraction_instructions. Takes ~20 min in
|
||||||
|
the background against the current ~4,300-entity graph; the chat caller
|
||||||
|
is not gated on this. Errors are logged, never raised."""
|
||||||
|
if os.getenv("SKIP_GRAPHITI_CHAT_PUSH"):
|
||||||
|
return
|
||||||
|
if not (user_message or "").strip() and not (assistant_message or "").strip():
|
||||||
|
return
|
||||||
|
import threading
|
||||||
|
from datetime import datetime as _dt
|
||||||
|
|
||||||
|
def _work():
|
||||||
|
try:
|
||||||
|
episode_name = f"chat-{conversation_id[:8]}-{_dt.now().strftime('%Y%m%dT%H%M%S')}"
|
||||||
|
content = (
|
||||||
|
f"User: {user_message}\n\n"
|
||||||
|
f"Assistant: {assistant_message}"
|
||||||
|
)
|
||||||
|
payload = {
|
||||||
|
"name": episode_name,
|
||||||
|
"content": content,
|
||||||
|
"source_description": f"chat turn (conversation {conversation_id})",
|
||||||
|
"timestamp": _dt.now().isoformat(),
|
||||||
|
"group_id": GRAPHITI_GROUP_ID,
|
||||||
|
}
|
||||||
|
# Long timeout — sidecar add_episode against the current graph
|
||||||
|
# is empirically ~20 min wall-clock. We're patient; chat isn't.
|
||||||
|
r = requests.post(f"{GRAPHITI_URL}/episodes", json=payload, timeout=1800)
|
||||||
|
if r.status_code == 200:
|
||||||
|
print(f"[graphiti-push] turn ingested: {episode_name}", flush=True)
|
||||||
|
else:
|
||||||
|
print(f"[graphiti-push] non-200 ({r.status_code}) for {episode_name}: {r.text[:200]}", flush=True)
|
||||||
|
except requests.RequestException as e:
|
||||||
|
print(f"[graphiti-push] request failed: {e}", flush=True)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[graphiti-push] unexpected error: {e}", flush=True)
|
||||||
|
|
||||||
|
threading.Thread(target=_work, daemon=True).start()
|
||||||
|
|
||||||
|
|
||||||
|
def _execute_search_facts(tool_input):
|
||||||
|
"""Hit Graphiti /search, format the results as text for Claude."""
|
||||||
|
query = (tool_input or {}).get("query", "").strip()
|
||||||
|
if not query:
|
||||||
|
return "No query provided."
|
||||||
|
try:
|
||||||
|
r = requests.get(
|
||||||
|
f"{GRAPHITI_URL}/search",
|
||||||
|
params={"query": query, "limit": 8, "group_id": GRAPHITI_GROUP_ID},
|
||||||
|
timeout=15,
|
||||||
|
)
|
||||||
|
except requests.RequestException as e:
|
||||||
|
return f"search_facts: Graphiti unreachable ({e})."
|
||||||
|
if r.status_code != 200:
|
||||||
|
return f"search_facts: Graphiti returned {r.status_code}."
|
||||||
|
results = r.json().get("results", [])
|
||||||
|
if not results:
|
||||||
|
return f"No facts found for {query!r}."
|
||||||
|
lines = []
|
||||||
|
for i, f in enumerate(results, 1):
|
||||||
|
fact = f.get("fact", "").strip()
|
||||||
|
valid_at = f.get("valid_at") or "?"
|
||||||
|
invalid_at = f.get("invalid_at")
|
||||||
|
validity = (f"valid {valid_at}" + (f" → superseded {invalid_at}"
|
||||||
|
if invalid_at and invalid_at != "None" else ""))
|
||||||
|
lines.append(f"[{i}] {fact} ({validity})")
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
SAVE_DOCUMENT_TOOL = {
|
||||||
|
"name": "save_document",
|
||||||
|
"description": (
|
||||||
|
"Render markdown content to docx or pdf and save it to Aaron's Nextcloud "
|
||||||
|
"Drafts/ folder (syncs to his other devices and web UI). Use this when "
|
||||||
|
"Aaron asks for a document file rather than chat text — bios, cover "
|
||||||
|
"letters, statements, CV sections, anything he'll edit or send. Returns "
|
||||||
|
"the saved filename. Pick a descriptive filename (no extension) like "
|
||||||
|
"'Aaron_Nelson_Bio_Utah_2026-05'. Format is 'docx' for editable drafts, "
|
||||||
|
"'pdf' for typeset/print-ready output. Content should be well-formed "
|
||||||
|
"markdown — # headings, **bold**, *italic*, - bulleted lists. Don't "
|
||||||
|
"embed file content in the chat response too; just call this tool and "
|
||||||
|
"tell Aaron where it landed."
|
||||||
|
),
|
||||||
|
"input_schema": {
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"content": {
|
||||||
|
"type": "string",
|
||||||
|
"description": "Document content in markdown.",
|
||||||
|
},
|
||||||
|
"filename": {
|
||||||
|
"type": "string",
|
||||||
|
"description": "Descriptive filename without extension.",
|
||||||
|
},
|
||||||
|
"format": {
|
||||||
|
"type": "string",
|
||||||
|
"enum": ["docx", "pdf"],
|
||||||
|
"description": "Output format.",
|
||||||
|
},
|
||||||
|
},
|
||||||
|
"required": ["content", "filename", "format"],
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _safe_filename(name: str, ext: str) -> str:
|
||||||
|
"""Strip path components and unsafe chars; force the requested extension."""
|
||||||
|
base = Path(name).name
|
||||||
|
base = _FILENAME_SAFE_RE.sub("_", base).strip().rstrip(".")
|
||||||
|
if not base:
|
||||||
|
base = "untitled"
|
||||||
|
base = Path(base).stem
|
||||||
|
return f"{base}.{ext}"
|
||||||
|
|
||||||
|
|
||||||
|
def _webdav_unique_url(base_url: str, filename: str, auth) -> tuple[str, str]:
|
||||||
|
"""Return a WebDAV URL that doesn't collide with an existing file. Appends
|
||||||
|
_2, _3, ... until PROPFIND returns 404. Matches the convention dream.py uses."""
|
||||||
|
stem = Path(filename).stem
|
||||||
|
suffix = Path(filename).suffix
|
||||||
|
name = filename
|
||||||
|
i = 2
|
||||||
|
while True:
|
||||||
|
url = f"{base_url}/{name}"
|
||||||
|
check = requests.request("PROPFIND", url, auth=auth, timeout=10)
|
||||||
|
if check.status_code == 404:
|
||||||
|
return url, name
|
||||||
|
name = f"{stem}_{i}{suffix}"
|
||||||
|
i += 1
|
||||||
|
if i > 50:
|
||||||
|
raise RuntimeError("could not find a free filename")
|
||||||
|
|
||||||
|
|
||||||
|
def _execute_save_document(tool_input):
|
||||||
|
"""Generate a document via pandoc and PUT it to Nextcloud Drafts/.
|
||||||
|
Returns a user-facing status string for Claude to relay."""
|
||||||
|
if not NEXTCLOUD_PASSWORD:
|
||||||
|
return "save_document: NEXTCLOUD_PASSWORD not configured."
|
||||||
|
|
||||||
|
payload = tool_input or {}
|
||||||
|
content = payload.get("content", "")
|
||||||
|
raw_filename = payload.get("filename", "untitled")
|
||||||
|
fmt = payload.get("format", "docx")
|
||||||
|
|
||||||
|
if not content.strip():
|
||||||
|
return "save_document: empty content, nothing saved."
|
||||||
|
if fmt not in ("docx", "pdf"):
|
||||||
|
return f"save_document: unsupported format {fmt!r}; use 'docx' or 'pdf'."
|
||||||
|
|
||||||
|
safe_name = _safe_filename(raw_filename, fmt)
|
||||||
|
auth = (NEXTCLOUD_USER, NEXTCLOUD_PASSWORD)
|
||||||
|
|
||||||
|
# Ensure Drafts/ exists. 201 = created, 405 = already there — both fine.
|
||||||
|
try:
|
||||||
|
requests.request("MKCOL", DRAFTS_WEBDAV, auth=auth, timeout=10)
|
||||||
|
except requests.RequestException as e:
|
||||||
|
return f"save_document: could not reach Nextcloud ({e})."
|
||||||
|
|
||||||
|
try:
|
||||||
|
url, final_name = _webdav_unique_url(DRAFTS_WEBDAV, safe_name, auth)
|
||||||
|
except (requests.RequestException, RuntimeError) as e:
|
||||||
|
return f"save_document: filename probe failed ({e})."
|
||||||
|
|
||||||
|
cmd = ["pandoc", "-f", "markdown", "-t", fmt, "-o", "-"]
|
||||||
|
if fmt == "pdf":
|
||||||
|
cmd.insert(-2, "--pdf-engine=xelatex")
|
||||||
|
try:
|
||||||
|
proc = subprocess.run(
|
||||||
|
cmd, input=content.encode("utf-8"),
|
||||||
|
capture_output=True, timeout=120,
|
||||||
|
)
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
return "save_document: pandoc timed out (>120s)."
|
||||||
|
except FileNotFoundError:
|
||||||
|
return ("save_document: pandoc binary not reachable from the api process "
|
||||||
|
"(check that PATH in aaronai.service includes /usr/bin).")
|
||||||
|
if proc.returncode != 0:
|
||||||
|
err = proc.stderr.decode("utf-8", errors="replace")[:400]
|
||||||
|
return f"save_document: pandoc failed: {err}"
|
||||||
|
|
||||||
|
try:
|
||||||
|
put = requests.put(url, data=proc.stdout, auth=auth, timeout=60)
|
||||||
|
except requests.RequestException as e:
|
||||||
|
return f"save_document: WebDAV upload failed ({e})."
|
||||||
|
if put.status_code not in (200, 201, 204):
|
||||||
|
return f"save_document: WebDAV upload returned {put.status_code}."
|
||||||
|
|
||||||
|
return f"Saved to Nextcloud: Drafts/{final_name}"
|
||||||
|
|
||||||
|
|
||||||
|
RETRIEVE_DOCUMENTS_TOOL = {
|
||||||
|
"name": "retrieve_documents",
|
||||||
|
"description": (
|
||||||
|
"Search Aaron's knowledge base — personal documents, reading library, "
|
||||||
|
"conversation transcripts, and journal entries — for content relevant "
|
||||||
|
"to a query. Call whenever you need concrete information you don't "
|
||||||
|
"already have from the persistent memory file. For compound questions "
|
||||||
|
"(e.g. 'bio emphasizing consulting work and recent research'), call "
|
||||||
|
"this tool multiple times with different concrete queries; one call "
|
||||||
|
"per distinct information need. Prefer specific named entities, "
|
||||||
|
"project names, course codes, or topic-specific terms over abstract "
|
||||||
|
"instructional phrasing — 'FWN3D consulting' retrieves better than "
|
||||||
|
"'my work'. Results are ranked by semantic + lexical hybrid retrieval "
|
||||||
|
"and a cross-encoder reranker; no taxonomy is applied, so judge each "
|
||||||
|
"returned chunk on its own merits and ignore irrelevant hits."
|
||||||
|
),
|
||||||
|
"input_schema": {
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"query": {
|
||||||
|
"type": "string",
|
||||||
|
"description": "The search query. Use concrete terms.",
|
||||||
|
},
|
||||||
|
},
|
||||||
|
"required": ["query"],
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _execute_retrieve_documents(tool_input):
|
||||||
|
"""Run retrieve_context for a tool call. Returns (tool_result_text, sources)."""
|
||||||
|
query = (tool_input or {}).get("query", "").strip()
|
||||||
|
if not query:
|
||||||
|
return ("No query provided.", [])
|
||||||
|
pieces, sources = retrieve_context(query)
|
||||||
|
if not pieces:
|
||||||
|
return (f"No results for query={query!r}.", [])
|
||||||
|
parts = []
|
||||||
|
for i, (piece, src) in enumerate(zip(pieces, sources), 1):
|
||||||
|
parts.append(f"[{i}] Source: {src}\n{piece}")
|
||||||
|
return ("\n\n---\n\n".join(parts), sources)
|
||||||
|
|
||||||
|
|
||||||
def chat(user_message, conversation_id, settings, client_time=None):
|
def chat(user_message, conversation_id, settings, client_time=None):
|
||||||
memory = load_memory()
|
memory = load_memory()
|
||||||
context_pieces, sources = retrieve_context(user_message)
|
|
||||||
history = get_conversation_history(conversation_id)
|
history = get_conversation_history(conversation_id)
|
||||||
|
|
||||||
context_parts = []
|
# System prompt + persistent memory are stable across the tool_use round-trip
|
||||||
if client_time:
|
# and across turns within the 5-minute cache TTL. Putting cache_control on the
|
||||||
context_parts.append(f"Current time (user-supplied, not logged): {client_time}")
|
# last system block creates a cache breakpoint here — the second LLM call in a
|
||||||
|
# tool_use turn reads this prefix from cache (~10% of standard input cost)
|
||||||
|
# instead of re-billing it. Memory lives here (not in the user message) so its
|
||||||
|
# position stays stable for cache hits.
|
||||||
|
system_blocks = [{"type": "text", "text": SYSTEM_PROMPT}]
|
||||||
if memory:
|
if memory:
|
||||||
context_parts.append(f"Aaron's persistent memory:\n\n{memory}")
|
system_blocks.append({
|
||||||
if context_pieces:
|
"type": "text",
|
||||||
context_str = "\n\n---\n\n".join(context_pieces)
|
"text": f"Aaron's persistent memory:\n\n{memory}",
|
||||||
unique_sources = list(set(sources))
|
})
|
||||||
context_parts.append(
|
system_blocks[-1]["cache_control"] = {"type": "ephemeral"}
|
||||||
f"Relevant excerpts from Aaron's documents:\n\n{context_str}\n\nSources: {', '.join(unique_sources)}"
|
|
||||||
|
# client_time is per-turn dynamic, so it stays out of the cached prefix.
|
||||||
|
if client_time:
|
||||||
|
full_message = (
|
||||||
|
f"Current time (user-supplied, not logged): {client_time}\n\n"
|
||||||
|
f"---\n\n{user_message}"
|
||||||
)
|
)
|
||||||
context_block = "\n\n====\n\n".join(context_parts) + "\n\n---\n\n" if context_parts else ""
|
else:
|
||||||
full_message = context_block + user_message
|
full_message = user_message
|
||||||
|
|
||||||
messages = history + [{"role": "user", "content": full_message}]
|
messages = history + [{"role": "user", "content": full_message}]
|
||||||
|
|
||||||
tools = [{"type": "web_search_20250305", "name": "web_search"}] if settings.get("web_search", True) else []
|
tools = [RETRIEVE_DOCUMENTS_TOOL, SEARCH_FACTS_TOOL, SAVE_DOCUMENT_TOOL]
|
||||||
|
if settings.get("web_search", True):
|
||||||
|
tools.append({"type": "web_search_20250305", "name": "web_search"})
|
||||||
|
|
||||||
|
accumulated_sources = []
|
||||||
|
retrieval_count = 0
|
||||||
|
|
||||||
while True:
|
while True:
|
||||||
kwargs = {
|
response = anthropic_client.messages.create(
|
||||||
"model": "claude-sonnet-4-6",
|
model="claude-sonnet-4-6",
|
||||||
"max_tokens": 2048,
|
max_tokens=2048,
|
||||||
"system": SYSTEM_PROMPT,
|
system=system_blocks,
|
||||||
"messages": messages
|
messages=messages,
|
||||||
}
|
tools=tools,
|
||||||
if tools:
|
)
|
||||||
kwargs["tools"] = tools
|
|
||||||
|
|
||||||
response = anthropic_client.messages.create(**kwargs)
|
|
||||||
|
|
||||||
if response.stop_reason == "tool_use":
|
if response.stop_reason == "tool_use":
|
||||||
messages.append({"role": "assistant", "content": response.content})
|
messages.append({"role": "assistant", "content": response.content})
|
||||||
tool_results = []
|
tool_results = []
|
||||||
for block in response.content:
|
for block in response.content:
|
||||||
if block.type == "tool_use":
|
if block.type != "tool_use":
|
||||||
|
continue
|
||||||
|
if block.name == "retrieve_documents":
|
||||||
|
if retrieval_count >= MAX_RETRIEVALS_PER_TURN:
|
||||||
|
result_text = (
|
||||||
|
f"Retrieval budget exhausted "
|
||||||
|
f"({MAX_RETRIEVALS_PER_TURN} calls used this turn). "
|
||||||
|
"Answer with the information you already have or "
|
||||||
|
"tell Aaron you need a more focused question."
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
result_text, result_sources = _execute_retrieve_documents(block.input)
|
||||||
|
accumulated_sources.extend(result_sources)
|
||||||
|
retrieval_count += 1
|
||||||
tool_results.append({
|
tool_results.append({
|
||||||
"type": "tool_result",
|
"type": "tool_result",
|
||||||
"tool_use_id": block.id,
|
"tool_use_id": block.id,
|
||||||
"content": "Search completed"
|
"content": result_text,
|
||||||
|
})
|
||||||
|
elif block.name == "search_facts":
|
||||||
|
result_text = _execute_search_facts(block.input)
|
||||||
|
tool_results.append({
|
||||||
|
"type": "tool_result",
|
||||||
|
"tool_use_id": block.id,
|
||||||
|
"content": result_text,
|
||||||
|
})
|
||||||
|
elif block.name == "save_document":
|
||||||
|
result_text = _execute_save_document(block.input)
|
||||||
|
tool_results.append({
|
||||||
|
"type": "tool_result",
|
||||||
|
"tool_use_id": block.id,
|
||||||
|
"content": result_text,
|
||||||
|
})
|
||||||
|
else:
|
||||||
|
tool_results.append({
|
||||||
|
"type": "tool_result",
|
||||||
|
"tool_use_id": block.id,
|
||||||
|
"content": "Search completed",
|
||||||
})
|
})
|
||||||
messages.append({"role": "user", "content": tool_results})
|
messages.append({"role": "user", "content": tool_results})
|
||||||
else:
|
else:
|
||||||
@@ -351,7 +816,18 @@ def chat(user_message, conversation_id, settings, client_time=None):
|
|||||||
for block in response.content:
|
for block in response.content:
|
||||||
if hasattr(block, "text"):
|
if hasattr(block, "text"):
|
||||||
assistant_message += block.text
|
assistant_message += block.text
|
||||||
return assistant_message, list(set(sources))
|
# Async fire-and-forget into Graphiti so the turn lands in the
|
||||||
|
# graph as a single episode for future search_facts queries to
|
||||||
|
# find. Takes ~20 min wall-clock in the background; chat returns
|
||||||
|
# immediately. Disable via SKIP_GRAPHITI_CHAT_PUSH=1 if needed.
|
||||||
|
_push_chat_turn_to_graphiti(conversation_id, user_message, assistant_message)
|
||||||
|
# Cap citations: accumulated_sources can grow large across multiple
|
||||||
|
# retrieve_documents calls and not every chunk that came back was
|
||||||
|
# actually used in the answer. Insertion order preserves rank
|
||||||
|
# (each call returns chunks reranker-ordered, so the earliest
|
||||||
|
# entries are the highest-relevance from the most direct queries).
|
||||||
|
deduped = list(dict.fromkeys(accumulated_sources))
|
||||||
|
return assistant_message, deduped[:MAX_CITED_SOURCES]
|
||||||
|
|
||||||
from contextlib import asynccontextmanager
|
from contextlib import asynccontextmanager
|
||||||
|
|
||||||
@@ -381,7 +857,7 @@ async def login(request: Request, response: Response):
|
|||||||
httponly=True,
|
httponly=True,
|
||||||
secure=True,
|
secure=True,
|
||||||
samesite="lax",
|
samesite="lax",
|
||||||
max_age=60 * 60 * 24 * 30
|
max_age=SESSION_MAX_AGE_SECONDS
|
||||||
)
|
)
|
||||||
response.body = b'{"ok": true}'
|
response.body = b'{"ok": true}'
|
||||||
response.status_code = 200
|
response.status_code = 200
|
||||||
@@ -686,44 +1162,92 @@ async def run_dreamer(request: Request, auth: str = Depends(require_auth)):
|
|||||||
return JSONResponse({"started": False, "error": str(e)})
|
return JSONResponse({"started": False, "error": str(e)})
|
||||||
|
|
||||||
def transcribe_and_save(tmp_path, timestamp, nextcloud_url, nextcloud_user, nextcloud_password):
|
def transcribe_and_save(tmp_path, timestamp, nextcloud_url, nextcloud_user, nextcloud_password):
|
||||||
"""Background task — transcribes audio and saves to Nextcloud after endpoint returns."""
|
"""Background task — transcribes audio and saves to Nextcloud after endpoint returns.
|
||||||
|
Audio is preserved in Journal/Media/ on every terminal path; failed and empty-transcript
|
||||||
|
captures still produce a markdown record in Journal/Captures/ with a status field."""
|
||||||
import requests as req_lib
|
import requests as req_lib
|
||||||
nc_auth = (nextcloud_user, nextcloud_password)
|
nc_auth = (nextcloud_user, nextcloud_password)
|
||||||
|
month_dir = timestamp[:7]
|
||||||
|
audio_ext = os.path.splitext(tmp_path)[1] or ".webm"
|
||||||
|
audio_filename = f"{timestamp}-voice{audio_ext}"
|
||||||
|
audio_relpath = f"Journal/Media/{month_dir}/{audio_filename}"
|
||||||
|
|
||||||
|
def archive_audio() -> bool:
|
||||||
|
try:
|
||||||
|
with open(tmp_path, "rb") as f:
|
||||||
|
audio_bytes = f.read()
|
||||||
|
media_parent = f"{nextcloud_url}/remote.php/dav/files/{nextcloud_user}/Journal/Media"
|
||||||
|
media_dir = f"{media_parent}/{month_dir}"
|
||||||
|
req_lib.request("MKCOL", media_parent, auth=nc_auth, timeout=10)
|
||||||
|
req_lib.request("MKCOL", media_dir, auth=nc_auth, timeout=10)
|
||||||
|
req_lib.put(f"{media_dir}/{audio_filename}", data=audio_bytes, auth=nc_auth, timeout=60)
|
||||||
|
return True
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Audio archival failed for {timestamp}: {e}")
|
||||||
|
return False
|
||||||
|
finally:
|
||||||
|
if os.path.exists(tmp_path):
|
||||||
|
os.unlink(tmp_path)
|
||||||
|
|
||||||
|
def write_capture(filename: str, content_md: str, status: str):
|
||||||
|
captures_dir = f"{nextcloud_url}/remote.php/dav/files/{nextcloud_user}/Journal/Captures"
|
||||||
|
try:
|
||||||
|
req_lib.request("MKCOL", captures_dir, auth=nc_auth, timeout=10)
|
||||||
|
req_lib.put(f"{captures_dir}/{filename}", data=content_md.encode("utf-8"), auth=nc_auth, timeout=30)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Capture markdown write failed for {timestamp}: {e}")
|
||||||
|
return
|
||||||
|
try:
|
||||||
|
payload = {"type": "capture_saved", "filename": filename, "timestamp": timestamp, "status": status}
|
||||||
|
req_lib.post("http://localhost:8000/api/events/notify", json=payload, timeout=3)
|
||||||
|
req_lib.post("http://localhost:8000/api/captures/events/notify", json=payload, timeout=3)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
transcript = ""
|
||||||
|
transcribe_error = None
|
||||||
try:
|
try:
|
||||||
segments, _ = whisper_model.transcribe(
|
segments, _ = whisper_model.transcribe(
|
||||||
tmp_path, language="en", vad_filter=True, beam_size=1, initial_prompt=WHISPER_PROMPT
|
tmp_path, language="en", vad_filter=True, beam_size=1, initial_prompt=WHISPER_PROMPT
|
||||||
)
|
)
|
||||||
transcript = " ".join(s.text.strip() for s in segments).strip()
|
transcript = " ".join(s.text.strip() for s in segments).strip()
|
||||||
os.unlink(tmp_path)
|
|
||||||
if not transcript:
|
|
||||||
print(f"Async transcription empty for {timestamp} — nothing saved")
|
|
||||||
return
|
|
||||||
filename = f"{timestamp}-voice.md"
|
|
||||||
content_md = f"# Capture — {timestamp}\n\n**type:** voice\n**modality:** audio\n**status:** unprocessed\n\n---\n\n{transcript}\n"
|
|
||||||
captures_dir = f"{nextcloud_url}/remote.php/dav/files/{nextcloud_user}/Journal/Captures"
|
|
||||||
req_lib.request("MKCOL", captures_dir, auth=nc_auth, timeout=10)
|
|
||||||
url = f"{captures_dir}/{filename}"
|
|
||||||
req_lib.put(url, data=content_md.encode("utf-8"), auth=nc_auth, timeout=30)
|
|
||||||
print(f"Async transcription saved: {filename}")
|
|
||||||
# Notify SSE clients that transcription is complete
|
|
||||||
try:
|
|
||||||
import requests as _req
|
|
||||||
_req.post("http://localhost:8000/api/events/notify", json={
|
|
||||||
"type": "capture_saved",
|
|
||||||
"filename": filename,
|
|
||||||
"timestamp": timestamp,
|
|
||||||
}, timeout=3)
|
|
||||||
_req.post("http://localhost:8000/api/captures/events/notify", json={
|
|
||||||
"type": "capture_saved",
|
|
||||||
"filename": filename,
|
|
||||||
"timestamp": timestamp,
|
|
||||||
}, timeout=3)
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
if os.path.exists(tmp_path):
|
transcribe_error = str(e)
|
||||||
os.unlink(tmp_path)
|
|
||||||
print(f"Async transcription failed for {timestamp}: {e}")
|
audio_archived = archive_audio()
|
||||||
|
audio_line = f"**audio_path:** {audio_relpath}\n" if audio_archived else "**audio_archive_failed:** true\n"
|
||||||
|
|
||||||
|
if transcribe_error is not None:
|
||||||
|
filename = f"{timestamp}-voice-failed.md"
|
||||||
|
content_md = (
|
||||||
|
f"# Capture — {timestamp}\n\n"
|
||||||
|
f"**type:** voice\n**modality:** audio\n**status:** failed_transcription\n"
|
||||||
|
f"{audio_line}"
|
||||||
|
f"**error:** {transcribe_error}\n"
|
||||||
|
)
|
||||||
|
write_capture(filename, content_md, "failed_transcription")
|
||||||
|
print(f"Async transcription failed for {timestamp}: {transcribe_error}")
|
||||||
|
return
|
||||||
|
|
||||||
|
if not transcript:
|
||||||
|
filename = f"{timestamp}-voice-empty.md"
|
||||||
|
content_md = (
|
||||||
|
f"# Capture — {timestamp}\n\n"
|
||||||
|
f"**type:** voice\n**modality:** audio\n**status:** empty_transcript\n"
|
||||||
|
f"{audio_line}"
|
||||||
|
)
|
||||||
|
write_capture(filename, content_md, "empty_transcript")
|
||||||
|
print(f"Async transcription empty for {timestamp}: audio archived")
|
||||||
|
return
|
||||||
|
|
||||||
|
filename = f"{timestamp}-voice.md"
|
||||||
|
content_md = (
|
||||||
|
f"# Capture — {timestamp}\n\n"
|
||||||
|
f"**type:** voice\n**modality:** audio\n**status:** saved\n"
|
||||||
|
f"{audio_line}\n---\n\n{transcript}\n"
|
||||||
|
)
|
||||||
|
write_capture(filename, content_md, "saved")
|
||||||
|
print(f"Async transcription saved: {filename}")
|
||||||
|
|
||||||
|
|
||||||
@app.post("/api/capture")
|
@app.post("/api/capture")
|
||||||
@@ -830,7 +1354,7 @@ Keep the full description to 150-250 words. Do not speculate beyond what is visi
|
|||||||
|
|
||||||
**type:** {capture_type}
|
**type:** {capture_type}
|
||||||
**modality:** {modality}
|
**modality:** {modality}
|
||||||
**status:** unprocessed
|
**status:** saved
|
||||||
**media:** {media_path}
|
**media:** {media_path}
|
||||||
{f"**project:** {project}" if project else ""}
|
{f"**project:** {project}" if project else ""}
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,128 @@
|
|||||||
|
"""One-off: backfill last_consolidated_at + consolidation_count on embeddings
|
||||||
|
from the dream-manifest-*.json files already in Journal/Dreams/.
|
||||||
|
|
||||||
|
Why this exists: the consolidation cursor columns added by the dreamer
|
||||||
|
redesign migration default to NULL / 0. Without history, the
|
||||||
|
underprocessed-count signal in dream_observation.observe_corpus() reports
|
||||||
|
"every chunk is underprocessed" (degenerate percentile), and NREM has no
|
||||||
|
basis to bias replay toward least-recently-consolidated chunks.
|
||||||
|
|
||||||
|
We have ~25 historical dream manifests in Nextcloud/Journal/Dreams/, each
|
||||||
|
listing the sources retrieved per stage. For each (manifest, source) pair
|
||||||
|
this script:
|
||||||
|
- finds matching embeddings rows by source (basename match)
|
||||||
|
- increments consolidation_count by 1
|
||||||
|
- updates last_consolidated_at to the manifest date (UTC midnight)
|
||||||
|
|
||||||
|
Idempotent: re-running will not double-count because we drop existing
|
||||||
|
cursor values to NULL/0 before backfilling. Pass --dry-run to print what
|
||||||
|
would change without writing.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
import psycopg2
|
||||||
|
|
||||||
|
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||||
|
|
||||||
|
PG_DSN = os.getenv("PG_DSN")
|
||||||
|
DREAMS_DIR = Path("/home/aaron/nextcloud/data/data/aaron/files/Journal/Dreams")
|
||||||
|
DRY_RUN = "--dry-run" in sys.argv
|
||||||
|
|
||||||
|
|
||||||
|
def get_pg():
|
||||||
|
return psycopg2.connect(PG_DSN)
|
||||||
|
|
||||||
|
|
||||||
|
def collect_manifest_records():
|
||||||
|
"""Return a list of (source_basename, manifest_date_utc) tuples from all
|
||||||
|
dream-manifest-*.json files. One pair per (manifest, source) appearance."""
|
||||||
|
pairs = []
|
||||||
|
if not DREAMS_DIR.exists():
|
||||||
|
return pairs
|
||||||
|
for path in sorted(DREAMS_DIR.glob("dream-manifest-*.json")):
|
||||||
|
try:
|
||||||
|
m = json.loads(path.read_text())
|
||||||
|
except Exception as e:
|
||||||
|
print(f" skip {path.name}: {e}")
|
||||||
|
continue
|
||||||
|
date_str = m.get("date")
|
||||||
|
if not date_str:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
dt = datetime.fromisoformat(date_str).replace(tzinfo=timezone.utc)
|
||||||
|
except ValueError:
|
||||||
|
continue
|
||||||
|
stages = m.get("stages") or {}
|
||||||
|
for stage_name in ("nrem", "early_rem", "late_rem", "synthesis"):
|
||||||
|
stage = stages.get(stage_name) or {}
|
||||||
|
for src in (stage.get("sources") or []):
|
||||||
|
if src:
|
||||||
|
pairs.append((src, dt))
|
||||||
|
return pairs
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
print(f"Mode: {'DRY-RUN' if DRY_RUN else 'APPLY'}")
|
||||||
|
print(f"Scanning manifests in {DREAMS_DIR}")
|
||||||
|
pairs = collect_manifest_records()
|
||||||
|
print(f"Collected {len(pairs)} (source, manifest_date) pairs across all manifests")
|
||||||
|
if not pairs:
|
||||||
|
print("Nothing to backfill.")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Aggregate per source: count + latest date
|
||||||
|
from collections import defaultdict
|
||||||
|
counts = defaultdict(int)
|
||||||
|
latest = {}
|
||||||
|
for src, dt in pairs:
|
||||||
|
counts[src] += 1
|
||||||
|
if src not in latest or dt > latest[src]:
|
||||||
|
latest[src] = dt
|
||||||
|
print(f"Unique sources to update: {len(counts)}")
|
||||||
|
|
||||||
|
# Sample what we'd write
|
||||||
|
print("Sample (top 5 by appearance count):")
|
||||||
|
for src, n in sorted(counts.items(), key=lambda kv: -kv[1])[:5]:
|
||||||
|
print(f" {n:>3} appearances — {src} → last_consolidated_at = {latest[src].date()}")
|
||||||
|
|
||||||
|
if DRY_RUN:
|
||||||
|
print("\nDry-run only. Re-run without --dry-run to apply.")
|
||||||
|
return
|
||||||
|
|
||||||
|
pg = get_pg()
|
||||||
|
cur = pg.cursor()
|
||||||
|
|
||||||
|
# Reset cursor for any sources we're about to backfill so reruns are clean.
|
||||||
|
print("\nResetting cursor for sources we'll touch...")
|
||||||
|
sources = list(counts.keys())
|
||||||
|
cur.execute(
|
||||||
|
"UPDATE embeddings SET last_consolidated_at = NULL, consolidation_count = 0 "
|
||||||
|
"WHERE source = ANY(%s)",
|
||||||
|
(sources,),
|
||||||
|
)
|
||||||
|
print(f" reset {cur.rowcount} embeddings rows")
|
||||||
|
|
||||||
|
# Apply per-source updates. For each source, set count and latest date.
|
||||||
|
print("Applying per-source backfill...")
|
||||||
|
updated_rows = 0
|
||||||
|
for src, n in counts.items():
|
||||||
|
cur.execute(
|
||||||
|
"UPDATE embeddings "
|
||||||
|
"SET consolidation_count = %s, last_consolidated_at = %s "
|
||||||
|
"WHERE source = %s",
|
||||||
|
(n, latest[src], src),
|
||||||
|
)
|
||||||
|
updated_rows += cur.rowcount
|
||||||
|
pg.commit()
|
||||||
|
pg.close()
|
||||||
|
print(f"Done. Updated {updated_rows} embeddings rows across {len(counts)} unique sources.")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
+327
-49
@@ -23,6 +23,7 @@ from datetime import datetime, timedelta
|
|||||||
from dotenv import load_dotenv
|
from dotenv import load_dotenv
|
||||||
import psycopg2
|
import psycopg2
|
||||||
import hashlib
|
import hashlib
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||||
|
|
||||||
@@ -42,6 +43,26 @@ NEXTCLOUD_USER = os.getenv("NEXTCLOUD_USER", "aaron")
|
|||||||
NEXTCLOUD_PASSWORD = os.getenv("NEXTCLOUD_PASSWORD", "")
|
NEXTCLOUD_PASSWORD = os.getenv("NEXTCLOUD_PASSWORD", "")
|
||||||
DREAMS_WEBDAV = f"{NEXTCLOUD_URL}/remote.php/dav/files/{NEXTCLOUD_USER}/Journal/Dreams"
|
DREAMS_WEBDAV = f"{NEXTCLOUD_URL}/remote.php/dav/files/{NEXTCLOUD_USER}/Journal/Dreams"
|
||||||
|
|
||||||
|
# ─── Retrieval-window config (per dreamer-multimodal-design.md §2) ─────────
|
||||||
|
# Biological grounding: NREM replays recent traces (24-72 hrs); REM links
|
||||||
|
# across time on structural similarity, not temporal proximity. Synthesis
|
||||||
|
# pulls from salience across the full corpus (no window). Spec calls for
|
||||||
|
# these to be mutable rather than hardcoded — this is the mutable home.
|
||||||
|
TIME_WINDOWS_HOURS = {
|
||||||
|
"nrem": 72, # 24-72 hrs, take wider end
|
||||||
|
"early-rem": 24 * 30, # 30 days
|
||||||
|
"late-rem": 24 * 90, # 90 days
|
||||||
|
"lucid": None, # no window
|
||||||
|
}
|
||||||
|
|
||||||
|
# Maximal Marginal Relevance: λ=1 → pure relevance, λ=0 → pure diversity.
|
||||||
|
# 0.5 is the standard balance; tune later if the dossier-cluster problem
|
||||||
|
# isn't sufficiently broken up.
|
||||||
|
MMR_LAMBDA = 0.5
|
||||||
|
|
||||||
|
# Fast/cheap model for query generation. Sonnet for synthesis (in synthesize_*).
|
||||||
|
LLM_QUERY_MODEL = os.getenv("DREAMER_QUERY_MODEL", "claude-haiku-4-5-20251001")
|
||||||
|
|
||||||
# Similarity ranges calibrated for all-MiniLM-L6-v2
|
# Similarity ranges calibrated for all-MiniLM-L6-v2
|
||||||
MODE_RANGES = {
|
MODE_RANGES = {
|
||||||
"nrem": (0.48, 0.72),
|
"nrem": (0.48, 0.72),
|
||||||
@@ -289,36 +310,207 @@ def _get_embedder():
|
|||||||
from sentence_transformers import SentenceTransformer
|
from sentence_transformers import SentenceTransformer
|
||||||
return SentenceTransformer("all-MiniLM-L6-v2")
|
return SentenceTransformer("all-MiniLM-L6-v2")
|
||||||
|
|
||||||
def retrieve(mode, task=None, n_results=8, excluded_sources=None, type_filter=None):
|
def _llm_generate_queries(mode, signal, task=None, n_queries=4):
|
||||||
# E3 experiment: DREAMER_SUBSTRATE=graphiti routes retrieval to Graphiti /search
|
"""Park et al. 2023 reflection-style query generation. Feeds the LLM the
|
||||||
# Default behavior: pgvector similarity search (unchanged)
|
observation signal + a mode-specific framing; emits N retrieval queries
|
||||||
# type_filter is experimental and applies to pgvector retrieval only — Graphiti
|
that probe different corners of the recent corpus instead of the same
|
||||||
# facts are not embeddings rows and have no embeddings.type to filter on.
|
hardcoded string every night. Sources cited in dream_observation.py.
|
||||||
substrate = os.getenv("DREAMER_SUBSTRATE", "pgvector")
|
|
||||||
if substrate == "graphiti":
|
Falls back to recent_questions from the signal if the LLM call fails."""
|
||||||
return retrieve_graphiti(mode, task=task, n_results=n_results, excluded_sources=excluded_sources)
|
import anthropic
|
||||||
embedder = _get_embedder()
|
|
||||||
low, high = MODE_RANGES[mode]
|
|
||||||
|
|
||||||
if task:
|
if task:
|
||||||
query = task
|
# Lucid mode: decompose the user's task into sub-queries
|
||||||
elif mode == "late-rem":
|
prompt = (
|
||||||
delta = observe_corpus()
|
f"Decompose this user task into {n_queries} distinct sub-questions, "
|
||||||
topics = delta.get("recent_topics", [])
|
f"each suitable as a retrieval query against Aaron's personal corpus.\n\n"
|
||||||
query = topics[0] if topics else "practice place memory making"
|
f"TASK: {task}\n\n"
|
||||||
elif mode == "early-rem":
|
f'Output JSON ONLY: {{"queries": ["...", "...", ...]}}'
|
||||||
query = "career decision personal change what matters next"
|
)
|
||||||
else:
|
else:
|
||||||
query = "research fabrication teaching practice recent work"
|
mode_framings = {
|
||||||
|
"nrem": (
|
||||||
|
"NREM is replay-and-consolidation of RECENT traces. Generate queries "
|
||||||
|
"that probe what Aaron has been working on or capturing in the last "
|
||||||
|
"few days. Concrete entities — project names, course codes, named "
|
||||||
|
"subjects. The dreamer is re-touching specific recent material to "
|
||||||
|
"strengthen schema connections, not finding novel content."
|
||||||
|
),
|
||||||
|
"early-rem": (
|
||||||
|
"Early REM is associative bridging with emotional/personal register. "
|
||||||
|
"Generate queries that surface unresolved themes, career questions, "
|
||||||
|
"ongoing personal threads — material that connects intellectual and "
|
||||||
|
"emotional dimensions. Tone: thoughtful friend, not researcher."
|
||||||
|
),
|
||||||
|
"late-rem": (
|
||||||
|
"Late REM tests novel connections across DISTANT material. Generate "
|
||||||
|
"queries that pair concrete subjects from DIFFERENT domains of Aaron's "
|
||||||
|
"work (e.g., one from academic teaching, one from consulting, one from "
|
||||||
|
"creative practice) to probe for surprising structural similarity. "
|
||||||
|
"Cross-domain is required."
|
||||||
|
),
|
||||||
|
}
|
||||||
|
framing = mode_framings.get(mode, mode_framings["nrem"])
|
||||||
|
questions_snippet = "\n".join(
|
||||||
|
f" - {q[:200]}" for q in signal.get("recent_questions", [])[:8]
|
||||||
|
) or " (no recent user questions)"
|
||||||
|
journal_snippet = ", ".join(signal.get("new_journal_entries", [])[:5]) or "(none)"
|
||||||
|
days_str = (
|
||||||
|
f"{signal['days_since_dream']:.1f}"
|
||||||
|
if signal.get("days_since_dream") not in (None, float("inf"))
|
||||||
|
else "infinite (first dream)"
|
||||||
|
)
|
||||||
|
prompt = (
|
||||||
|
f"You generate retrieval queries for an Active Inference dreamer. The "
|
||||||
|
f"dreamer surfaces prediction errors — gaps between Aaron's model and "
|
||||||
|
f"reality — not summaries or generic associations.\n\n"
|
||||||
|
f"MODE: {mode}\n"
|
||||||
|
f"FRAMING: {framing}\n\n"
|
||||||
|
f"OBSERVATION SIGNAL:\n"
|
||||||
|
f"- Days since last dream: {days_str}\n"
|
||||||
|
f"- New chunks since last dream: {signal.get('new_chunks', 0)}\n"
|
||||||
|
f"- New journal entries: {journal_snippet}\n"
|
||||||
|
f"- Underprocessed chunks pool: {signal.get('underprocessed_count', 0):,}\n\n"
|
||||||
|
f"RECENT USER QUESTIONS (last 14 days, top 8):\n{questions_snippet}\n\n"
|
||||||
|
f"Generate {n_queries} retrieval queries. Requirements:\n"
|
||||||
|
f"- Use concrete entities, named projects, course codes, specific topics "
|
||||||
|
f"— NOT generic phrasing like 'research work practice'\n"
|
||||||
|
f"- Each query probes a DIFFERENT corner of recent activity\n"
|
||||||
|
f"- Match the {mode} framing\n"
|
||||||
|
f"- 5-15 words each\n\n"
|
||||||
|
f'Output JSON ONLY: {{"queries": ["...", "...", ...]}}'
|
||||||
|
)
|
||||||
|
|
||||||
embedding = embedder.encode([query]).tolist()[0]
|
try:
|
||||||
chunks = []
|
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
|
||||||
seen_sources = set()
|
resp = client.messages.create(
|
||||||
|
model=LLM_QUERY_MODEL,
|
||||||
|
max_tokens=512,
|
||||||
|
messages=[{"role": "user", "content": prompt}],
|
||||||
|
)
|
||||||
|
text = "".join(b.text for b in resp.content if hasattr(b, "text")).strip()
|
||||||
|
if text.startswith("```"):
|
||||||
|
text = text.split("```", 2)[1]
|
||||||
|
if text.startswith("json"):
|
||||||
|
text = text[4:]
|
||||||
|
text = text.strip()
|
||||||
|
data = json.loads(text)
|
||||||
|
queries = data.get("queries", [])
|
||||||
|
if isinstance(queries, list) and queries:
|
||||||
|
return [str(q).strip() for q in queries[:n_queries] if str(q).strip()]
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[dream] LLM query generation failed ({e}); falling back to recent questions")
|
||||||
|
|
||||||
|
fallback = signal.get("recent_questions", [])[:n_queries] if signal else []
|
||||||
|
return fallback or [task or "recent activity decisions thinking"]
|
||||||
|
|
||||||
|
|
||||||
|
def _mmr_select(candidate_embeddings, query_embedding, n, lambda_=MMR_LAMBDA):
|
||||||
|
"""Maximal Marginal Relevance — greedy selection that balances relevance
|
||||||
|
against pairwise diversity. Carbonell & Goldstein 1998. Used to prevent
|
||||||
|
cluster lock-in (e.g., 8 dossier-narrative variants filling all 8 slots).
|
||||||
|
|
||||||
|
candidate_embeddings: (N, D) numpy array
|
||||||
|
query_embedding: (D,) numpy array
|
||||||
|
Returns: list of indices into candidate_embeddings, len ≤ n."""
|
||||||
|
if len(candidate_embeddings) == 0:
|
||||||
|
return []
|
||||||
|
n = min(n, len(candidate_embeddings))
|
||||||
|
cands = candidate_embeddings / (np.linalg.norm(candidate_embeddings, axis=1, keepdims=True) + 1e-9)
|
||||||
|
q = query_embedding / (np.linalg.norm(query_embedding) + 1e-9)
|
||||||
|
relevance = cands @ q
|
||||||
|
selected = []
|
||||||
|
remaining = list(range(len(cands)))
|
||||||
|
while len(selected) < n and remaining:
|
||||||
|
if not selected:
|
||||||
|
best = max(remaining, key=lambda i: relevance[i])
|
||||||
|
else:
|
||||||
|
sel = cands[selected]
|
||||||
|
scores = {
|
||||||
|
i: lambda_ * relevance[i] - (1 - lambda_) * float((cands[i] @ sel.T).max())
|
||||||
|
for i in remaining
|
||||||
|
}
|
||||||
|
best = max(scores, key=scores.get)
|
||||||
|
selected.append(best)
|
||||||
|
remaining.remove(best)
|
||||||
|
return selected
|
||||||
|
|
||||||
|
|
||||||
|
def _bump_consolidation_cursor(chunks):
|
||||||
|
"""Increment consolidation_count + set last_consolidated_at=NOW() for each
|
||||||
|
source represented in chunks. Called from dream_pipeline after NREM
|
||||||
|
completes. Per sharp-wave-ripples biology, NREM does the actual
|
||||||
|
consolidation; REM is associative use, so we only bump on NREM."""
|
||||||
|
if not chunks:
|
||||||
|
return
|
||||||
|
sources = list({c["source"] for c in chunks if c.get("source")})
|
||||||
|
if not sources:
|
||||||
|
return
|
||||||
try:
|
try:
|
||||||
pg = get_pg()
|
pg = get_pg()
|
||||||
cur = pg.cursor()
|
cur = pg.cursor()
|
||||||
|
cur.execute(
|
||||||
|
"UPDATE embeddings "
|
||||||
|
"SET consolidation_count = consolidation_count + 1, "
|
||||||
|
" last_consolidated_at = NOW() "
|
||||||
|
"WHERE source = ANY(%s)",
|
||||||
|
(sources,),
|
||||||
|
)
|
||||||
|
pg.commit()
|
||||||
|
pg.close()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[dream] cursor bump failed (non-fatal): {e}")
|
||||||
|
|
||||||
|
|
||||||
|
def retrieve(mode, task=None, n_results=8, excluded_sources=None,
|
||||||
|
type_filter=None, signal=None):
|
||||||
|
"""Refactored retrieval — see dreamer-design-spec.md Stage 3 + the
|
||||||
|
external-literature prescription in birdai-dreamer-exclusion-finding-2026-05-02.md.
|
||||||
|
|
||||||
|
Changes from the prior hardcoded-query version:
|
||||||
|
- Queries are LLM-generated from the observation signal (Park et al.
|
||||||
|
reflection pattern) instead of fixed strings. Solves the "same 8 sources
|
||||||
|
every night" failure where fixed seeds locked into one neighborhood.
|
||||||
|
- Per-mode time windows (24-72hr NREM / 30d Early REM / 90d Late REM)
|
||||||
|
filter candidates before vector search. Spec calls for these to be
|
||||||
|
mutable; they live in TIME_WINDOWS_HOURS.
|
||||||
|
- NREM biases toward under-processed chunks (low consolidation_count).
|
||||||
|
Biologically motivated: sharp-wave ripples tag what to replay, not
|
||||||
|
uniform sampling.
|
||||||
|
- Multiple queries (4 by default) → over-fetch → MMR merge for
|
||||||
|
within-night diversity. Prevents cluster domination.
|
||||||
|
|
||||||
|
signal is the observation-signal dict from dream_observation.observe_corpus().
|
||||||
|
If None, observe_corpus is called inline (back-compat for ad-hoc invocation).
|
||||||
|
"""
|
||||||
|
# E3 substrate experiment unchanged
|
||||||
|
substrate = os.getenv("DREAMER_SUBSTRATE", "pgvector")
|
||||||
|
if substrate == "graphiti":
|
||||||
|
return retrieve_graphiti(mode, task=task, n_results=n_results,
|
||||||
|
excluded_sources=excluded_sources)
|
||||||
|
|
||||||
|
if signal is None:
|
||||||
|
from dream_observation import observe_corpus as _obs
|
||||||
|
signal = _obs()
|
||||||
|
|
||||||
|
queries = _llm_generate_queries(mode, signal, task=task, n_queries=4)
|
||||||
|
if not queries:
|
||||||
|
print(f"[dream:{mode}] no queries generated; bailing")
|
||||||
|
return []
|
||||||
|
print(f"[dream:{mode}] generated queries: {queries}")
|
||||||
|
|
||||||
|
embedder = _get_embedder()
|
||||||
excluded_sources = excluded_sources or set()
|
excluded_sources = excluded_sources or set()
|
||||||
|
window_hours = TIME_WINDOWS_HOURS.get(mode)
|
||||||
|
per_query_n = 12 # over-fetch for MMR
|
||||||
|
|
||||||
|
candidates = []
|
||||||
|
seen_ids = set()
|
||||||
|
try:
|
||||||
|
pg = get_pg()
|
||||||
|
cur = pg.cursor()
|
||||||
|
for q in queries:
|
||||||
|
q_emb = embedder.encode([q]).tolist()[0]
|
||||||
where, params = [], []
|
where, params = [], []
|
||||||
if excluded_sources:
|
if excluded_sources:
|
||||||
where.append("source NOT IN %s")
|
where.append("source NOT IN %s")
|
||||||
@@ -326,33 +518,85 @@ def retrieve(mode, task=None, n_results=8, excluded_sources=None, type_filter=No
|
|||||||
if type_filter:
|
if type_filter:
|
||||||
where.append("type = ANY(%s)")
|
where.append("type = ANY(%s)")
|
||||||
params.append(list(type_filter))
|
params.append(list(type_filter))
|
||||||
|
if window_hours is not None:
|
||||||
|
# created_at is TEXT (legacy); cast it. NULL created_at fails
|
||||||
|
# the comparison so legacy rows are excluded from windowed
|
||||||
|
# modes — correct: NULL means "indexed before cursor existed,"
|
||||||
|
# which by definition is older than any window.
|
||||||
|
where.append(
|
||||||
|
f"(created_at IS NOT NULL AND "
|
||||||
|
f"created_at::timestamptz > NOW() - INTERVAL '{int(window_hours)} hours')"
|
||||||
|
)
|
||||||
where_clause = ("WHERE " + " AND ".join(where)) if where else ""
|
where_clause = ("WHERE " + " AND ".join(where)) if where else ""
|
||||||
|
# NREM bias: order by consolidation_count ASC first (under-processed
|
||||||
|
# chunks win the tiebreak before vector distance). Other modes:
|
||||||
|
# vector distance only.
|
||||||
|
order_clause = (
|
||||||
|
"ORDER BY consolidation_count ASC, embedding <=> %s::vector"
|
||||||
|
if mode == "nrem"
|
||||||
|
else "ORDER BY embedding <=> %s::vector"
|
||||||
|
)
|
||||||
cur.execute(f"""
|
cur.execute(f"""
|
||||||
SELECT document, source, type, 1 - (embedding <=> %s::vector) as similarity
|
SELECT id, document, source, type, embedding,
|
||||||
|
1 - (embedding <=> %s::vector) as similarity
|
||||||
FROM embeddings
|
FROM embeddings
|
||||||
{where_clause}
|
{where_clause}
|
||||||
ORDER BY embedding <=> %s::vector
|
{order_clause}
|
||||||
LIMIT %s
|
LIMIT %s
|
||||||
""", [embedding, *params, embedding, n_results * 3])
|
""", [q_emb, *params, q_emb, per_query_n])
|
||||||
|
for row in cur.fetchall():
|
||||||
for doc, source, etype, similarity in cur.fetchall():
|
if row[0] in seen_ids:
|
||||||
if not (low <= similarity <= high):
|
|
||||||
continue
|
continue
|
||||||
if source in seen_sources:
|
seen_ids.add(row[0])
|
||||||
continue
|
emb = row[4]
|
||||||
chunks.append({
|
# pgvector returns embeddings as string "[...]" by default
|
||||||
"source": source or "unknown",
|
if isinstance(emb, str):
|
||||||
"content": doc,
|
emb = np.array([float(x) for x in emb.strip("[]").split(",")])
|
||||||
"relevance": similarity,
|
else:
|
||||||
"similarity": similarity,
|
emb = np.array(emb)
|
||||||
"type": etype,
|
candidates.append({
|
||||||
|
"id": row[0],
|
||||||
|
"content": row[1],
|
||||||
|
"source": row[2] or "unknown",
|
||||||
|
"type": row[3],
|
||||||
|
"embedding": emb,
|
||||||
|
"similarity": float(row[5]),
|
||||||
})
|
})
|
||||||
seen_sources.add(source)
|
|
||||||
if len(chunks) >= n_results:
|
|
||||||
break
|
|
||||||
pg.close()
|
pg.close()
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"pgvector retrieval error: {e}")
|
import traceback
|
||||||
|
print(f"[dream:{mode}] retrieval SQL error: {e}")
|
||||||
|
traceback.print_exc()
|
||||||
|
return []
|
||||||
|
|
||||||
|
if not candidates:
|
||||||
|
print(f"[dream:{mode}] zero candidates after filters")
|
||||||
|
return []
|
||||||
|
|
||||||
|
# MMR over the union, using the first query as pivot for the relevance term.
|
||||||
|
# Averaging query embeddings would be theoretically cleaner but adds
|
||||||
|
# complexity for marginal benefit at this scale.
|
||||||
|
pivot_emb = np.array(embedder.encode([queries[0]]).tolist()[0])
|
||||||
|
cand_embs = np.array([c["embedding"] for c in candidates])
|
||||||
|
selected_idx = _mmr_select(cand_embs, pivot_emb, n=n_results * 2)
|
||||||
|
|
||||||
|
# Post-MMR source-level dedup (multi-chunk same source collapses to one).
|
||||||
|
chunks = []
|
||||||
|
seen_sources = set()
|
||||||
|
for i in selected_idx:
|
||||||
|
c = candidates[i]
|
||||||
|
if c["source"] in seen_sources:
|
||||||
|
continue
|
||||||
|
seen_sources.add(c["source"])
|
||||||
|
chunks.append({
|
||||||
|
"source": c["source"],
|
||||||
|
"content": c["content"],
|
||||||
|
"relevance": c["similarity"],
|
||||||
|
"similarity": c["similarity"],
|
||||||
|
"type": c["type"],
|
||||||
|
})
|
||||||
|
if len(chunks) >= n_results:
|
||||||
|
break
|
||||||
|
|
||||||
return chunks
|
return chunks
|
||||||
|
|
||||||
@@ -496,6 +740,12 @@ def dream_pipeline(type_filter=None):
|
|||||||
"""
|
"""
|
||||||
Full nightly pipeline — interdependent stages.
|
Full nightly pipeline — interdependent stages.
|
||||||
NREM output feeds Early REM. Both feed Late REM. All three feed Synthesis.
|
NREM output feeds Early REM. Both feed Late REM. All three feed Synthesis.
|
||||||
|
|
||||||
|
Per dreamer-design-spec.md, this now runs Stage 1 (observe) and Stage 2
|
||||||
|
(select) first. If select_mode returns None — corpus unchanged and no new
|
||||||
|
journal entry — the dreamer goes quiet rather than manufacturing novelty.
|
||||||
|
Otherwise NREM/Early-REM/Late-REM run with LLM-generated queries seeded
|
||||||
|
from the observation signal.
|
||||||
"""
|
"""
|
||||||
print(f"Dreamer pipeline starting — {datetime.now().strftime('%Y-%m-%d %H:%M')}")
|
print(f"Dreamer pipeline starting — {datetime.now().strftime('%Y-%m-%d %H:%M')}")
|
||||||
|
|
||||||
@@ -503,21 +753,47 @@ def dream_pipeline(type_filter=None):
|
|||||||
state.pop("retrieved_sources", None) # legacy key; session-scoped novelty now
|
state.pop("retrieved_sources", None) # legacy key; session-scoped novelty now
|
||||||
session_retrieved = set()
|
session_retrieved = set()
|
||||||
|
|
||||||
delta = observe_corpus()
|
# ── Stage 1 + 2: Observe + Select ──────────────────────────────────────
|
||||||
print(f"Corpus: {delta['new_chunks']} new chunks, {delta['days_since_dream']:.1f} days since last dream")
|
from dream_observation import observe_corpus as _obs, select_mode as _select
|
||||||
print("Novelty: session-scoped (no across-night exclusion)")
|
signal = _obs()
|
||||||
|
print(
|
||||||
|
f"Signal: new_chunks={signal['new_chunks']}, "
|
||||||
|
f"new_journal={len(signal['new_journal_entries'])}, "
|
||||||
|
f"days_since={signal['days_since_dream']:.1f}, "
|
||||||
|
f"underprocessed={signal['underprocessed_count']:,}"
|
||||||
|
)
|
||||||
|
selected = _select(signal)
|
||||||
|
if selected is None:
|
||||||
|
print("[select_mode] None — nothing worth dreaming about tonight (going quiet)")
|
||||||
|
# Update last-dream-attempted-at but not last_dream — caller can distinguish
|
||||||
|
# an actual dream from a skipped night by looking at last_dream_file or
|
||||||
|
# checking the manifest dir.
|
||||||
|
state["last_select_quiet_at"] = datetime.now().isoformat()
|
||||||
|
save_dreamer_state(state)
|
||||||
|
return None
|
||||||
|
print(f"[select_mode] → {selected}")
|
||||||
|
|
||||||
# ── Stage 1: NREM ──────────────────────────────────────────────────────
|
# The pipeline always runs all three modes for the manifest's continuity.
|
||||||
|
# select_mode's choice signals the *primary* focus; the others still run
|
||||||
|
# but draw from their own mode-appropriate windows.
|
||||||
|
primary_mode = selected
|
||||||
|
|
||||||
|
# ── Stage 3: NREM ──────────────────────────────────────────────────────
|
||||||
print("\n[NREM] Retrieving...")
|
print("\n[NREM] Retrieving...")
|
||||||
# NREM is replay-and-consolidation — does not exclude prior traces.
|
# NREM is replay-and-consolidation — does not exclude prior traces.
|
||||||
# Late REM and Early REM exclude prior content for novelty; NREM does not.
|
# Late REM and Early REM exclude prior content for novelty; NREM does not.
|
||||||
nrem_chunks = retrieve("nrem", excluded_sources=None, type_filter=type_filter)
|
nrem_chunks = retrieve("nrem", excluded_sources=None,
|
||||||
|
type_filter=type_filter, signal=signal)
|
||||||
session_retrieved.update(c["source"] for c in nrem_chunks)
|
session_retrieved.update(c["source"] for c in nrem_chunks)
|
||||||
# Track sources that scored above Early REM ceiling — these are the only ones Early REM should exclude
|
# Track sources that scored above Early REM ceiling — these are the only ones Early REM should exclude
|
||||||
nrem_high_sources = {c["source"] for c in nrem_chunks if c["similarity"] > 0.55}
|
nrem_high_sources = {c["source"] for c in nrem_chunks if c["similarity"] > 0.55}
|
||||||
if not nrem_chunks:
|
if not nrem_chunks:
|
||||||
print("[NREM] No suitable chunks — aborting pipeline")
|
print("[NREM] No suitable chunks — aborting pipeline")
|
||||||
return None
|
return None
|
||||||
|
# Cursor bump: NREM is the consolidation stage. Each appearance increments
|
||||||
|
# consolidation_count + updates last_consolidated_at, so the next dream's
|
||||||
|
# observation sees these sources as less under-processed.
|
||||||
|
_bump_consolidation_cursor(nrem_chunks)
|
||||||
|
|
||||||
print(f"[NREM] Retrieved {len(nrem_chunks)} chunks. Synthesizing...")
|
print(f"[NREM] Retrieved {len(nrem_chunks)} chunks. Synthesizing...")
|
||||||
nrem_output = synthesize_nrem(nrem_chunks)
|
nrem_output = synthesize_nrem(nrem_chunks)
|
||||||
@@ -528,7 +804,7 @@ def dream_pipeline(type_filter=None):
|
|||||||
"nrem": {
|
"nrem": {
|
||||||
"chunks_retrieved": len(nrem_chunks),
|
"chunks_retrieved": len(nrem_chunks),
|
||||||
"avg_similarity": round(sum(c["relevance"] for c in nrem_chunks) / len(nrem_chunks), 3),
|
"avg_similarity": round(sum(c["relevance"] for c in nrem_chunks) / len(nrem_chunks), 3),
|
||||||
"query": "research fabrication teaching practice recent work",
|
"query": "[llm-generated from observation signal]",
|
||||||
"word_count": len(nrem_output.split()),
|
"word_count": len(nrem_output.split()),
|
||||||
"sources": nrem_sources,
|
"sources": nrem_sources,
|
||||||
"distinct_folders": nrem_folders,
|
"distinct_folders": nrem_folders,
|
||||||
@@ -546,7 +822,8 @@ def dream_pipeline(type_filter=None):
|
|||||||
print("\n[Early REM] Retrieving...")
|
print("\n[Early REM] Retrieving...")
|
||||||
# Early REM excludes previously retrieved + NREM high-scorers only (not full session_retrieved)
|
# Early REM excludes previously retrieved + NREM high-scorers only (not full session_retrieved)
|
||||||
# Sources that scored in Early REM band during NREM remain available
|
# Sources that scored in Early REM band during NREM remain available
|
||||||
early_chunks = retrieve("early-rem", excluded_sources=nrem_high_sources, type_filter=type_filter)
|
early_chunks = retrieve("early-rem", excluded_sources=nrem_high_sources,
|
||||||
|
type_filter=type_filter, signal=signal)
|
||||||
session_retrieved.update(c["source"] for c in early_chunks)
|
session_retrieved.update(c["source"] for c in early_chunks)
|
||||||
if not early_chunks:
|
if not early_chunks:
|
||||||
print("[Early REM] No suitable chunks — skipping")
|
print("[Early REM] No suitable chunks — skipping")
|
||||||
@@ -560,7 +837,7 @@ def dream_pipeline(type_filter=None):
|
|||||||
stage_data["early_rem"] = {
|
stage_data["early_rem"] = {
|
||||||
"chunks_retrieved": len(early_chunks),
|
"chunks_retrieved": len(early_chunks),
|
||||||
"avg_similarity": round(sum(c["relevance"] for c in early_chunks) / len(early_chunks), 3),
|
"avg_similarity": round(sum(c["relevance"] for c in early_chunks) / len(early_chunks), 3),
|
||||||
"query": "career decision personal change what matters next",
|
"query": "[llm-generated from observation signal]",
|
||||||
"word_count": len(early_rem_output.split()),
|
"word_count": len(early_rem_output.split()),
|
||||||
"sources": early_sources,
|
"sources": early_sources,
|
||||||
"distinct_folders": early_folders,
|
"distinct_folders": early_folders,
|
||||||
@@ -572,7 +849,8 @@ def dream_pipeline(type_filter=None):
|
|||||||
|
|
||||||
# ── Stage 3: Late REM — informed by NREM + Early REM ──────────────────
|
# ── Stage 3: Late REM — informed by NREM + Early REM ──────────────────
|
||||||
print("\n[Late REM] Retrieving...")
|
print("\n[Late REM] Retrieving...")
|
||||||
late_chunks = retrieve("late-rem", excluded_sources=session_retrieved, type_filter=type_filter)
|
late_chunks = retrieve("late-rem", excluded_sources=session_retrieved,
|
||||||
|
type_filter=type_filter, signal=signal)
|
||||||
session_retrieved.update(c["source"] for c in late_chunks)
|
session_retrieved.update(c["source"] for c in late_chunks)
|
||||||
if not late_chunks:
|
if not late_chunks:
|
||||||
print("[Late REM] No suitable chunks — skipping")
|
print("[Late REM] No suitable chunks — skipping")
|
||||||
@@ -591,7 +869,7 @@ def dream_pipeline(type_filter=None):
|
|||||||
stage_data["late_rem"] = {
|
stage_data["late_rem"] = {
|
||||||
"chunks_retrieved": len(late_chunks),
|
"chunks_retrieved": len(late_chunks),
|
||||||
"avg_similarity": round(sum(c["relevance"] for c in late_chunks) / len(late_chunks), 3),
|
"avg_similarity": round(sum(c["relevance"] for c in late_chunks) / len(late_chunks), 3),
|
||||||
"query": "practice place memory making",
|
"query": "[llm-generated from observation signal]",
|
||||||
"word_count": len(late_rem_output.split()),
|
"word_count": len(late_rem_output.split()),
|
||||||
"sources": late_sources,
|
"sources": late_sources,
|
||||||
"distinct_folders": list(set(late_folders)),
|
"distinct_folders": list(set(late_folders)),
|
||||||
|
|||||||
@@ -0,0 +1,235 @@
|
|||||||
|
"""
|
||||||
|
Dreamer Stages 1 + 2 — Observe and Select.
|
||||||
|
|
||||||
|
Implements `dreamer-design-spec.md`'s Stage 1 (observe_corpus) and Stage 2
|
||||||
|
(select_mode). These have been latent in dream.py — observe_corpus existed
|
||||||
|
in skeletal form but its output was largely unused; select_mode did not
|
||||||
|
exist at all. The dreamer always ran all stages with hardcoded queries.
|
||||||
|
|
||||||
|
Per spec (lines 27–34 of dreamer-design-spec.md):
|
||||||
|
delta = observe_corpus()
|
||||||
|
selected_mode = select_mode(delta, task, project)
|
||||||
|
if selected_mode is None:
|
||||||
|
return # nothing worth dreaming
|
||||||
|
|
||||||
|
The "returns None — dreamer goes quiet rather than manufacturing novelty"
|
||||||
|
semantics (spec line 67) is the canonical answer to the repetition problem
|
||||||
|
documented in birdai-dreamer-exclusion-finding-2026-05-02.md.
|
||||||
|
|
||||||
|
Grounded in:
|
||||||
|
- Active Inference (Friston 2010, 2017) — observe error, choose action that
|
||||||
|
minimizes free energy. The dreamer is a prediction-error machine; observe
|
||||||
|
what's diverged from the model, dream about that.
|
||||||
|
- Sleep stages (Stickgold 2005; Walker 2017; Diekelberg & Born 2010) — NREM
|
||||||
|
for replay of new traces, REM for associative cross-cluster integration.
|
||||||
|
- Sharp-wave ripples (Buzsáki, Wilson) — biology tags WHAT to replay
|
||||||
|
(under-processed chunks); not uniform. Implemented via the consolidation
|
||||||
|
cursor on the embeddings table.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sqlite3
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
import psycopg2
|
||||||
|
|
||||||
|
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||||
|
|
||||||
|
# ─── Paths ──────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
PG_DSN = os.getenv("PG_DSN")
|
||||||
|
CONVERSATIONS_DB = str(Path.home() / "aaronai" / "conversations.db")
|
||||||
|
WATCHER_STATE = str(Path.home() / "aaronai" / "watcher_state.json")
|
||||||
|
DREAMER_STATE = str(Path.home() / "aaronai" / "dreamer_state.json")
|
||||||
|
JOURNAL_DAILY = "/home/aaron/nextcloud/data/data/aaron/files/Journal/Daily"
|
||||||
|
|
||||||
|
# ─── Thresholds ─────────────────────────────────────────────────────────────
|
||||||
|
# Per spec, these become settings-panel controls eventually. For now they're
|
||||||
|
# constants here; moving them to a config module is task #48.
|
||||||
|
|
||||||
|
NEW_CHUNK_THRESHOLD = 5 # below this, NREM not warranted on novelty alone
|
||||||
|
STALENESS_TRIGGER_DAYS = 3 # corpus quiet ≥3 days → Late REM ("shake things loose")
|
||||||
|
QUESTION_LOOKBACK_DAYS = 14 # spec line 61: "the last 14 days"
|
||||||
|
UNDERPROCESSED_PERCENTILE = 0.25 # bottom quartile of consolidation_count
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Helpers ────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def _get_pg():
|
||||||
|
return psycopg2.connect(PG_DSN)
|
||||||
|
|
||||||
|
|
||||||
|
def _load_json(path, default):
|
||||||
|
try:
|
||||||
|
return json.loads(Path(path).read_text())
|
||||||
|
except Exception:
|
||||||
|
return default
|
||||||
|
|
||||||
|
|
||||||
|
def _recent_user_questions(days=QUESTION_LOOKBACK_DAYS, limit=20):
|
||||||
|
"""Pull recent user-turn content from conversations.db. The spec calls
|
||||||
|
these 'live questions' — what Aaron has been asking about. They become
|
||||||
|
seed material for the REM modes."""
|
||||||
|
try:
|
||||||
|
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||||
|
cutoff = (datetime.now() - timedelta(days=days)).isoformat()
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute(
|
||||||
|
"""
|
||||||
|
SELECT m.content FROM messages m
|
||||||
|
JOIN conversations c ON m.conversation_id = c.id
|
||||||
|
WHERE m.role = 'user' AND c.updated_at > ?
|
||||||
|
ORDER BY m.timestamp DESC LIMIT ?
|
||||||
|
""",
|
||||||
|
(cutoff, limit),
|
||||||
|
)
|
||||||
|
rows = cur.fetchall()
|
||||||
|
conn.close()
|
||||||
|
return [r[0][:280] for r in rows]
|
||||||
|
except Exception:
|
||||||
|
return []
|
||||||
|
|
||||||
|
|
||||||
|
def _new_journal_entries(since_ts):
|
||||||
|
"""Files in Journal/Daily/ created or modified since the last dream.
|
||||||
|
Journal entries with emotional/personal register route to Early REM per
|
||||||
|
the spec (line 71)."""
|
||||||
|
journal_path = Path(JOURNAL_DAILY)
|
||||||
|
if not journal_path.exists():
|
||||||
|
return []
|
||||||
|
new = []
|
||||||
|
for p in journal_path.rglob("*.md"):
|
||||||
|
try:
|
||||||
|
if p.stat().st_mtime > since_ts:
|
||||||
|
new.append(str(p.relative_to(journal_path)))
|
||||||
|
except OSError:
|
||||||
|
continue
|
||||||
|
return new
|
||||||
|
|
||||||
|
|
||||||
|
def _new_chunks_count(since_ts):
|
||||||
|
"""Files in the watcher state with mtime > last_dream. The spec calls
|
||||||
|
this 'what changed' (line 58). Used as the NREM novelty signal."""
|
||||||
|
state = _load_json(WATCHER_STATE, {})
|
||||||
|
count = 0
|
||||||
|
for _path, mtime in state.items():
|
||||||
|
try:
|
||||||
|
if float(mtime) > since_ts:
|
||||||
|
count += 1
|
||||||
|
except (ValueError, TypeError):
|
||||||
|
continue
|
||||||
|
return count
|
||||||
|
|
||||||
|
|
||||||
|
def _underprocessed_chunk_count():
|
||||||
|
"""Chunks below the underprocessed percentile by consolidation_count.
|
||||||
|
Biologically motivated: sharp-wave ripples bias replay toward novel /
|
||||||
|
under-encoded experience, not uniform sampling. We give NREM a pool of
|
||||||
|
'least-replayed' chunks to draw from in Stage 3."""
|
||||||
|
try:
|
||||||
|
pg = _get_pg()
|
||||||
|
cur = pg.cursor()
|
||||||
|
cur.execute(
|
||||||
|
"""
|
||||||
|
WITH t AS (
|
||||||
|
SELECT percentile_cont(%s) WITHIN GROUP (ORDER BY consolidation_count)
|
||||||
|
AS threshold
|
||||||
|
FROM embeddings
|
||||||
|
)
|
||||||
|
SELECT COUNT(*) FROM embeddings, t
|
||||||
|
WHERE consolidation_count <= t.threshold
|
||||||
|
""",
|
||||||
|
(UNDERPROCESSED_PERCENTILE,),
|
||||||
|
)
|
||||||
|
result = cur.fetchone()[0]
|
||||||
|
pg.close()
|
||||||
|
return int(result or 0)
|
||||||
|
except Exception:
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Stage 1: observe_corpus ────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def observe_corpus():
|
||||||
|
"""Build the signal vector consumed by select_mode and (downstream) by
|
||||||
|
retrieve. Concrete observations only — no interpretation. Each key is
|
||||||
|
a direct measurement from the corpus, watcher, journal, or conversation
|
||||||
|
log.
|
||||||
|
|
||||||
|
Returns a dict with:
|
||||||
|
now_ts -- current Unix timestamp
|
||||||
|
last_dream_ts -- last completed dream timestamp (0 if never)
|
||||||
|
days_since_dream -- float; inf if never dreamed
|
||||||
|
new_chunks -- count of files newer than last_dream
|
||||||
|
new_journal_entries -- list of Journal/Daily/*.md filenames since last_dream
|
||||||
|
recent_questions -- user-turn content from last 14 days
|
||||||
|
underprocessed_count -- chunks in the bottom 25% by consolidation_count
|
||||||
|
"""
|
||||||
|
state = _load_json(DREAMER_STATE, {})
|
||||||
|
last_dream_ts = float(state.get("last_dream_timestamp", 0) or 0)
|
||||||
|
now_ts = datetime.now().timestamp()
|
||||||
|
|
||||||
|
return {
|
||||||
|
"now_ts": now_ts,
|
||||||
|
"last_dream_ts": last_dream_ts,
|
||||||
|
"days_since_dream": (now_ts - last_dream_ts) / 86400 if last_dream_ts else float("inf"),
|
||||||
|
"new_chunks": _new_chunks_count(last_dream_ts),
|
||||||
|
"new_journal_entries": _new_journal_entries(last_dream_ts),
|
||||||
|
"recent_questions": _recent_user_questions(),
|
||||||
|
"underprocessed_count": _underprocessed_chunk_count(),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ─── Stage 2: select_mode ───────────────────────────────────────────────────
|
||||||
|
|
||||||
|
def select_mode(signal, task=None, explicit_mode=None):
|
||||||
|
"""Return one of {'nrem', 'early-rem', 'late-rem', 'lucid'}. Never None.
|
||||||
|
|
||||||
|
The dreamer fires every scheduled night. The earlier "go quiet on null
|
||||||
|
delta" rule was a synthesis-doc invention that didn't match the actual
|
||||||
|
desired UX — the original dreamer always dreamed, even if it repeated
|
||||||
|
itself. The cure for repetition lives in the retrieve layer
|
||||||
|
(LLM-generated queries from the observation signal, MMR diversity,
|
||||||
|
cursor bias toward under-processed chunks), not in skipping nights.
|
||||||
|
|
||||||
|
Routing logic:
|
||||||
|
- explicit_mode argument wins
|
||||||
|
- task supplied → 'lucid' (question-anchored)
|
||||||
|
- days_since_dream ≥ STALENESS_TRIGGER_DAYS → 'late-rem' (shake loose
|
||||||
|
via cross-domain pairs when nothing's been added in a while)
|
||||||
|
- new journal entry → 'early-rem' (emotional/personal register)
|
||||||
|
- default → 'nrem' (replay-and-consolidation; always has something to
|
||||||
|
do because the corpus always has under-processed chunks)
|
||||||
|
"""
|
||||||
|
if explicit_mode:
|
||||||
|
return explicit_mode
|
||||||
|
if task:
|
||||||
|
return "lucid"
|
||||||
|
|
||||||
|
days_since = signal["days_since_dream"]
|
||||||
|
new_journal = signal["new_journal_entries"]
|
||||||
|
|
||||||
|
if days_since >= STALENESS_TRIGGER_DAYS:
|
||||||
|
return "late-rem"
|
||||||
|
|
||||||
|
if new_journal:
|
||||||
|
return "early-rem"
|
||||||
|
|
||||||
|
return "nrem"
|
||||||
|
|
||||||
|
|
||||||
|
# ─── CLI for manual inspection ──────────────────────────────────────────────
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
signal = observe_corpus()
|
||||||
|
short = {k: v for k, v in signal.items() if k != "recent_questions"}
|
||||||
|
print("Signal (excluding recent_questions):")
|
||||||
|
print(json.dumps(short, indent=2, default=str))
|
||||||
|
print(f"\nRecent user questions ({len(signal['recent_questions'])}):")
|
||||||
|
for q in signal["recent_questions"][:5]:
|
||||||
|
print(f" - {q[:140]}")
|
||||||
|
mode = select_mode(signal)
|
||||||
|
print(f"\nselect_mode() → {mode!r}")
|
||||||
+189
-35
@@ -1,17 +1,20 @@
|
|||||||
"""
|
"""
|
||||||
Aaron AI Stage 1 encoding helpers — single canonical implementation of:
|
Aaron AI Stage 1 encoding helpers — single canonical implementation of:
|
||||||
- extract_text(filepath) — four-extension text extraction
|
- extract_blocks(filepath) — section-aware extraction (docx heading-bounded
|
||||||
- chunk_text(text, chunk_size, overlap) — word-based chunking
|
sections, pptx per-slide, pdf/txt/md single-block)
|
||||||
- chunk_and_embed(text, source, embedder, filepath, folder) — produce ready-to-write rows
|
- extract_text(filepath) — back-compat string concatenation over blocks
|
||||||
|
- chunk_text(text, chunk_size, overlap) — word-based blind chunking
|
||||||
|
- chunk_and_embed(text_or_blocks, source, embedder, filepath, folder) —
|
||||||
|
produce ready-to-write rows. Accepts str (blind) or list[dict] (section-aware).
|
||||||
- write_embeddings_batch(conn, batch) — server-side NOW() canonical INSERT
|
- write_embeddings_batch(conn, batch) — server-side NOW() canonical INSERT
|
||||||
|
|
||||||
Used by watcher.py, ingest.py, corpus_integrity.py, and api.py /api/corpus/retry.
|
Used by watcher.py, ingest.py, corpus_integrity.py, and api.py /api/corpus/retry.
|
||||||
Replaces four separate extract reimplementations and two extract-chunk-embed paths.
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import hashlib
|
import hashlib
|
||||||
import json
|
import json
|
||||||
import logging
|
import logging
|
||||||
|
import re
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
from docx import Document as DocxDocument
|
from docx import Document as DocxDocument
|
||||||
@@ -24,6 +27,62 @@ SUPPORTED = {".docx", ".pdf", ".pptx", ".txt", ".md"}
|
|||||||
DEFAULT_CHUNK_SIZE = 500
|
DEFAULT_CHUNK_SIZE = 500
|
||||||
DEFAULT_CHUNK_OVERLAP = 50
|
DEFAULT_CHUNK_OVERLAP = 50
|
||||||
|
|
||||||
|
_BOLD_KV_RE = re.compile(r"^\*\*[\w +/-]+?:\*\*")
|
||||||
|
|
||||||
|
|
||||||
|
def _strip_md_frontmatter(text: str) -> str:
|
||||||
|
"""Strip a leading frontmatter block from markdown, if present.
|
||||||
|
|
||||||
|
Recognizes two formats:
|
||||||
|
- YAML-style: file's first non-empty line is `---`, terminated by `---`.
|
||||||
|
Only triggered when no heading precedes — guards against `---`
|
||||||
|
horizontal rules that follow an H1.
|
||||||
|
- Capture-style: optional H1 heading, then one or more `**key:** value`
|
||||||
|
lines (and blanks), terminated by `---`. The H1 is preserved; the
|
||||||
|
key/value block + separator are removed.
|
||||||
|
|
||||||
|
Body `---` rules and body `**bold:**` lines are never touched — the scan
|
||||||
|
aborts as soon as a non-frontmatter line appears in the leading block.
|
||||||
|
"""
|
||||||
|
lines = text.splitlines()
|
||||||
|
n = len(lines)
|
||||||
|
i = 0
|
||||||
|
while i < n and not lines[i].strip():
|
||||||
|
i += 1
|
||||||
|
heading = None
|
||||||
|
if i < n and lines[i].startswith("# "):
|
||||||
|
heading = lines[i]
|
||||||
|
i += 1
|
||||||
|
while i < n and not lines[i].strip():
|
||||||
|
i += 1
|
||||||
|
if i >= n:
|
||||||
|
return text
|
||||||
|
first = lines[i].strip()
|
||||||
|
if heading is None and first == "---":
|
||||||
|
j = i + 1
|
||||||
|
while j < n and lines[j].strip() != "---":
|
||||||
|
j += 1
|
||||||
|
if j >= n:
|
||||||
|
return text
|
||||||
|
body_start = j + 1
|
||||||
|
elif _BOLD_KV_RE.match(first):
|
||||||
|
j = i
|
||||||
|
while j < n:
|
||||||
|
s = lines[j].strip()
|
||||||
|
if not s or _BOLD_KV_RE.match(s):
|
||||||
|
j += 1
|
||||||
|
continue
|
||||||
|
if s == "---":
|
||||||
|
body_start = j + 1
|
||||||
|
break
|
||||||
|
return text
|
||||||
|
else:
|
||||||
|
return text
|
||||||
|
else:
|
||||||
|
return text
|
||||||
|
body = "\n".join(lines[body_start:]).lstrip("\n")
|
||||||
|
return f"{heading}\n\n{body}" if heading else body
|
||||||
|
|
||||||
|
|
||||||
def _docx_cell_paragraphs(cell):
|
def _docx_cell_paragraphs(cell):
|
||||||
yield from (p for p in cell.paragraphs if p.text.strip())
|
yield from (p for p in cell.paragraphs if p.text.strip())
|
||||||
@@ -49,12 +108,15 @@ def _pptx_shape_text(shape):
|
|||||||
return parts
|
return parts
|
||||||
|
|
||||||
|
|
||||||
def extract_text(filepath: Path) -> str:
|
def _extract_docx_blocks(filepath: Path) -> list[dict]:
|
||||||
"""Return the text of a supported file. Returns "" on any failure or
|
"""Return docx content as a single block. Earlier attempt at section-aware
|
||||||
unsupported extension. Does not write to ingest_failures — caller decides."""
|
chunking via Heading styles was rolled back: the user's docs are mostly
|
||||||
suffix = filepath.suffix.lower()
|
Normal-styled with bold-as-heading, and tying chunk boundaries to formatting
|
||||||
try:
|
choices locks future-them into preserving those choices forever. Lexical
|
||||||
if suffix == ".docx":
|
+ cross-encoder retrieval already finds the right substrings within a
|
||||||
|
blind-chunked CV, so the section structure isn't load-bearing for retrieval."""
|
||||||
|
from docx.oxml.ns import qn
|
||||||
|
|
||||||
doc = DocxDocument(filepath)
|
doc = DocxDocument(filepath)
|
||||||
parts = [p.text for p in doc.paragraphs if p.text.strip()]
|
parts = [p.text for p in doc.paragraphs if p.text.strip()]
|
||||||
for tbl in doc.tables:
|
for tbl in doc.tables:
|
||||||
@@ -64,35 +126,88 @@ def extract_text(filepath: Path) -> str:
|
|||||||
for section in doc.sections:
|
for section in doc.sections:
|
||||||
parts.extend(p.text for p in section.header.paragraphs if p.text.strip())
|
parts.extend(p.text for p in section.header.paragraphs if p.text.strip())
|
||||||
parts.extend(p.text for p in section.footer.paragraphs if p.text.strip())
|
parts.extend(p.text for p in section.footer.paragraphs if p.text.strip())
|
||||||
from docx.oxml.ns import qn
|
|
||||||
for txbx in doc.element.body.findall(".//" + qn("w:txbxContent")):
|
for txbx in doc.element.body.findall(".//" + qn("w:txbxContent")):
|
||||||
for p in txbx.findall(".//" + qn("w:p")):
|
for p in txbx.findall(".//" + qn("w:p")):
|
||||||
text = "".join(t.text or "" for t in p.findall(".//" + qn("w:t")))
|
text = "".join(t.text or "" for t in p.findall(".//" + qn("w:t")))
|
||||||
if text.strip():
|
if text.strip():
|
||||||
parts.append(text)
|
parts.append(text)
|
||||||
return "\n".join(parts)
|
text = "\n".join(parts)
|
||||||
elif suffix == ".pdf":
|
return [{"heading": None, "text": text, "kind": "doc"}] if text.strip() else []
|
||||||
reader = PdfReader(filepath)
|
|
||||||
return "".join(
|
|
||||||
page.extract_text() + "\n"
|
def _extract_pptx_blocks(filepath: Path) -> list[dict]:
|
||||||
for page in reader.pages if page.extract_text()
|
"""One block per slide. Heading = slide title (or 'Slide N' fallback).
|
||||||
)
|
Body = non-title shape text + speaker notes."""
|
||||||
elif suffix == ".pptx":
|
|
||||||
prs = Presentation(filepath)
|
prs = Presentation(filepath)
|
||||||
parts = []
|
blocks = []
|
||||||
for slide in prs.slides:
|
for i, slide in enumerate(prs.slides, 1):
|
||||||
|
title_shape = None
|
||||||
|
try:
|
||||||
|
title_shape = slide.shapes.title
|
||||||
|
except (AttributeError, KeyError):
|
||||||
|
pass
|
||||||
|
title = None
|
||||||
|
body_parts = []
|
||||||
for shape in slide.shapes:
|
for shape in slide.shapes:
|
||||||
parts.extend(_pptx_shape_text(shape))
|
if title_shape is not None and shape == title_shape and shape.has_text_frame:
|
||||||
|
title = shape.text_frame.text.strip() or None
|
||||||
|
continue
|
||||||
|
body_parts.extend(_pptx_shape_text(shape))
|
||||||
if slide.has_notes_slide:
|
if slide.has_notes_slide:
|
||||||
notes = slide.notes_slide.notes_text_frame.text
|
notes = slide.notes_slide.notes_text_frame.text
|
||||||
if notes.strip():
|
if notes.strip():
|
||||||
parts.append(notes)
|
body_parts.append(f"[Notes] {notes}")
|
||||||
return "\n".join(parts)
|
if title or body_parts:
|
||||||
elif suffix in {".txt", ".md"}:
|
blocks.append({
|
||||||
return filepath.read_text(encoding="utf-8", errors="ignore")
|
"heading": title or f"Slide {i}",
|
||||||
|
"text": "\n".join(body_parts),
|
||||||
|
"kind": "slide",
|
||||||
|
})
|
||||||
|
return blocks
|
||||||
|
|
||||||
|
|
||||||
|
def extract_blocks(filepath: Path) -> list[dict]:
|
||||||
|
"""Structured extraction. Returns list of {heading, text, kind} blocks.
|
||||||
|
|
||||||
|
- docx: section-aware via Heading-style paragraphs (kind='section').
|
||||||
|
- pptx: one block per slide (kind='slide').
|
||||||
|
- pdf/txt/md: single block, no heading (kind='doc').
|
||||||
|
|
||||||
|
Empty list on any failure or unsupported extension."""
|
||||||
|
suffix = filepath.suffix.lower()
|
||||||
|
try:
|
||||||
|
if suffix == ".docx":
|
||||||
|
return _extract_docx_blocks(filepath)
|
||||||
|
if suffix == ".pptx":
|
||||||
|
return _extract_pptx_blocks(filepath)
|
||||||
|
if suffix == ".pdf":
|
||||||
|
reader = PdfReader(filepath)
|
||||||
|
text = "".join(
|
||||||
|
page.extract_text() + "\n"
|
||||||
|
for page in reader.pages if page.extract_text()
|
||||||
|
)
|
||||||
|
return [{"heading": None, "text": text, "kind": "doc"}] if text.strip() else []
|
||||||
|
if suffix in {".txt", ".md"}:
|
||||||
|
text = filepath.read_text(encoding="utf-8", errors="ignore")
|
||||||
|
if suffix == ".md":
|
||||||
|
text = _strip_md_frontmatter(text)
|
||||||
|
return [{"heading": None, "text": text, "kind": "doc"}] if text.strip() else []
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
log.warning(f"Text extraction failed for {filepath.name}: {e}")
|
log.warning(f"Extraction failed for {filepath.name}: {e}")
|
||||||
return ""
|
return []
|
||||||
|
|
||||||
|
|
||||||
|
def extract_text(filepath: Path) -> str:
|
||||||
|
"""Back-compat wrapper: concatenate extract_blocks() output. Section
|
||||||
|
structure is lost; use extract_blocks() directly for chunking."""
|
||||||
|
blocks = extract_blocks(filepath)
|
||||||
|
parts = []
|
||||||
|
for b in blocks:
|
||||||
|
if b.get("heading"):
|
||||||
|
parts.append(b["heading"])
|
||||||
|
if b.get("text"):
|
||||||
|
parts.append(b["text"])
|
||||||
|
return "\n".join(parts)
|
||||||
|
|
||||||
|
|
||||||
def chunk_text(text: str,
|
def chunk_text(text: str,
|
||||||
@@ -115,18 +230,49 @@ def _chunk_id(filepath, source: str, index: int) -> str:
|
|||||||
return f"{hashlib.md5(basis.encode()).hexdigest()[:8]}_{index}"
|
return f"{hashlib.md5(basis.encode()).hexdigest()[:8]}_{index}"
|
||||||
|
|
||||||
|
|
||||||
def chunk_and_embed(text: str,
|
def chunk_and_embed(text_or_blocks,
|
||||||
source: str,
|
source: str,
|
||||||
embedder,
|
embedder,
|
||||||
filepath=None,
|
filepath=None,
|
||||||
folder=None) -> list[dict]:
|
folder=None) -> list[dict]:
|
||||||
"""Chunk text, embed each chunk, return rows ready for write_embeddings_batch."""
|
"""Chunk + embed for write_embeddings_batch. Accepts either:
|
||||||
chunks = chunk_text(text)
|
|
||||||
|
- str: blind chunking with 500-word windows (pdf/txt/md legacy path).
|
||||||
|
- list[dict]: section-aware path (docx Heading-bounded sections, pptx
|
||||||
|
slides). Each block emits one chunk if its text fits within
|
||||||
|
DEFAULT_CHUNK_SIZE words, otherwise is blind-split with overlap.
|
||||||
|
|
||||||
|
The block heading is prepended to the chunk text (so retrieval sees the
|
||||||
|
section context) and stored in metadata as heading/kind."""
|
||||||
|
if isinstance(text_or_blocks, str):
|
||||||
|
blocks = [{"heading": None, "text": text_or_blocks, "kind": "doc"}]
|
||||||
|
else:
|
||||||
|
blocks = text_or_blocks
|
||||||
|
|
||||||
|
chunks = []
|
||||||
|
for block in blocks:
|
||||||
|
body = block.get("text") or ""
|
||||||
|
heading = block.get("heading")
|
||||||
|
kind = block.get("kind", "doc")
|
||||||
|
if not body.strip() and not (heading and heading.strip()):
|
||||||
|
continue
|
||||||
|
if heading and body.strip():
|
||||||
|
contextualized = f"{heading}\n\n{body}"
|
||||||
|
elif heading:
|
||||||
|
contextualized = heading
|
||||||
|
else:
|
||||||
|
contextualized = body
|
||||||
|
if len(contextualized.split()) <= DEFAULT_CHUNK_SIZE:
|
||||||
|
chunks.append((contextualized, heading, kind))
|
||||||
|
else:
|
||||||
|
for sub in chunk_text(contextualized):
|
||||||
|
chunks.append((sub, heading, kind))
|
||||||
|
|
||||||
if not chunks:
|
if not chunks:
|
||||||
return []
|
return []
|
||||||
embeddings = embedder.encode(chunks).tolist()
|
embeddings = embedder.encode([c[0] for c in chunks]).tolist()
|
||||||
rows = []
|
rows = []
|
||||||
for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
|
for i, ((chunk, heading, kind), emb) in enumerate(zip(chunks, embeddings)):
|
||||||
rows.append({
|
rows.append({
|
||||||
"id": _chunk_id(filepath, source, i),
|
"id": _chunk_id(filepath, source, i),
|
||||||
"document": chunk,
|
"document": chunk,
|
||||||
@@ -137,13 +283,15 @@ def chunk_and_embed(text: str,
|
|||||||
"source": source,
|
"source": source,
|
||||||
"filepath": str(filepath) if filepath else source,
|
"filepath": str(filepath) if filepath else source,
|
||||||
"folder": folder,
|
"folder": folder,
|
||||||
|
"heading": heading,
|
||||||
|
"kind": kind,
|
||||||
},
|
},
|
||||||
})
|
})
|
||||||
return rows
|
return rows
|
||||||
|
|
||||||
|
|
||||||
def write_embeddings_batch(conn, batch: list[dict]) -> int:
|
def write_embeddings_batch(conn, batch: list[dict], commit: bool = True) -> int:
|
||||||
"""Single canonical INSERT. Sets created_at = NOW() server-side. Commits.
|
"""Single canonical INSERT. Sets created_at = NOW() server-side.
|
||||||
|
|
||||||
Every row dict must supply 'type'. created_at is SQL-supplied (NOW()), so
|
Every row dict must supply 'type'. created_at is SQL-supplied (NOW()), so
|
||||||
callers do not need to provide it. The application-layer assertion is the
|
callers do not need to provide it. The application-layer assertion is the
|
||||||
@@ -151,6 +299,11 @@ def write_embeddings_batch(conn, batch: list[dict]) -> int:
|
|||||||
historical NULLs were resolved by the Improvement #2 backfill, and a
|
historical NULLs were resolved by the Improvement #2 backfill, and a
|
||||||
Python-level raise gives a faster, more debuggable failure than a
|
Python-level raise gives a faster, more debuggable failure than a
|
||||||
Postgres constraint error.
|
Postgres constraint error.
|
||||||
|
|
||||||
|
When commit=True (default), this function commits the connection itself.
|
||||||
|
When commit=False, the caller is responsible for committing. Use
|
||||||
|
commit=False when composing this write with other writes that must land
|
||||||
|
atomically in the same transaction.
|
||||||
"""
|
"""
|
||||||
if not batch:
|
if not batch:
|
||||||
return 0
|
return 0
|
||||||
@@ -173,5 +326,6 @@ def write_embeddings_batch(conn, batch: list[dict]) -> int:
|
|||||||
metadata = EXCLUDED.metadata
|
metadata = EXCLUDED.metadata
|
||||||
""", (row["id"], row["document"], row["embedding"],
|
""", (row["id"], row["document"], row["embedding"],
|
||||||
row["source"], row["type"], json.dumps(row["metadata"])))
|
row["source"], row["type"], json.dumps(row["metadata"])))
|
||||||
|
if commit:
|
||||||
conn.commit()
|
conn.commit()
|
||||||
return len(batch)
|
return len(batch)
|
||||||
|
|||||||
@@ -75,6 +75,17 @@ async def lifespan(app: FastAPI):
|
|||||||
max_coroutines=2,
|
max_coroutines=2,
|
||||||
)
|
)
|
||||||
await graphiti_instance.build_indices_and_constraints()
|
await graphiti_instance.build_indices_and_constraints()
|
||||||
|
# Bridge driver._search_ops to driver.search_interface — graphiti-core 0.29.0
|
||||||
|
# builds FalkorSearchOperations as driver._search_ops in FalkorDriver.__init__
|
||||||
|
# but never assigns it to driver.search_interface. search_utils.py dispatches
|
||||||
|
# on driver.search_interface; without this assignment it falls back to
|
||||||
|
# interpreted-Cypher cosine math (full table scans). Together with the
|
||||||
|
# vendored patches in graphiti_patches/, this activates FalkorDB's native
|
||||||
|
# vector index for entity dedup similarity search.
|
||||||
|
if (hasattr(graphiti_instance.driver, "_search_ops")
|
||||||
|
and graphiti_instance.driver.search_interface is None):
|
||||||
|
graphiti_instance.driver.search_interface = graphiti_instance.driver._search_ops
|
||||||
|
log.info("Wired driver.search_interface = driver._search_ops (vector index path active)")
|
||||||
log.info(f"Graphiti ready — provider: {LLM_PROVIDER}, group: {GROUP_ID}")
|
log.info(f"Graphiti ready — provider: {LLM_PROVIDER}, group: {GROUP_ID}")
|
||||||
yield
|
yield
|
||||||
await graphiti_instance.close()
|
await graphiti_instance.close()
|
||||||
|
|||||||
+25
-6
@@ -15,7 +15,7 @@ from dotenv import load_dotenv
|
|||||||
import psycopg2
|
import psycopg2
|
||||||
from sentence_transformers import SentenceTransformer
|
from sentence_transformers import SentenceTransformer
|
||||||
|
|
||||||
from encoding import extract_text, chunk_and_embed, write_embeddings_batch, SUPPORTED
|
from encoding import extract_blocks, chunk_and_embed, write_embeddings_batch, SUPPORTED
|
||||||
from failures import (
|
from failures import (
|
||||||
record_ingest_failure as _record_failure_sql,
|
record_ingest_failure as _record_failure_sql,
|
||||||
resolve_ingest_failure as _resolve_failure_sql,
|
resolve_ingest_failure as _resolve_failure_sql,
|
||||||
@@ -77,14 +77,29 @@ def _resolve_failure(source: str) -> None:
|
|||||||
print(f" Could not resolve ingest failure record (non-fatal): {e}")
|
print(f" Could not resolve ingest failure record (non-fatal): {e}")
|
||||||
|
|
||||||
|
|
||||||
|
IGNORED_TOP_FOLDERS = {"Drafts"}
|
||||||
|
|
||||||
|
|
||||||
def _ingest_one(filepath: Path, embedder, root: Path = None) -> int:
|
def _ingest_one(filepath: Path, embedder, root: Path = None) -> int:
|
||||||
"""Ingest a single file. Returns chunk count, 0 on skip/failure."""
|
"""Ingest a single file. Returns chunk count, 0 on skip/failure."""
|
||||||
if filepath.name.startswith(("~$", ".")):
|
# "~" catches Office lock files (~$) including the case where Nextcloud
|
||||||
|
# filesystem encoding has mangled the "$" to a unicode replacement char.
|
||||||
|
if filepath.name.startswith(("~", ".")):
|
||||||
return 0
|
return 0
|
||||||
if filepath.suffix.lower() not in SUPPORTED:
|
if filepath.suffix.lower() not in SUPPORTED:
|
||||||
return 0
|
return 0
|
||||||
text = extract_text(filepath)
|
if root is not None:
|
||||||
if not text.strip():
|
try:
|
||||||
|
rel = filepath.parent.relative_to(root)
|
||||||
|
if rel.parts and rel.parts[0] in IGNORED_TOP_FOLDERS:
|
||||||
|
return 0
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
|
blocks = extract_blocks(filepath)
|
||||||
|
if not blocks or not any(
|
||||||
|
(b.get("text") or "").strip() or (b.get("heading") or "").strip()
|
||||||
|
for b in blocks
|
||||||
|
):
|
||||||
_record_failure(filepath, "Text extraction failed or empty")
|
_record_failure(filepath, "Text extraction failed or empty")
|
||||||
return 0
|
return 0
|
||||||
folder_rel = None
|
folder_rel = None
|
||||||
@@ -94,7 +109,7 @@ def _ingest_one(filepath: Path, embedder, root: Path = None) -> int:
|
|||||||
except ValueError:
|
except ValueError:
|
||||||
pass
|
pass
|
||||||
try:
|
try:
|
||||||
rows = chunk_and_embed(text, filepath.name, embedder,
|
rows = chunk_and_embed(blocks, filepath.name, embedder,
|
||||||
filepath=filepath, folder=folder_rel)
|
filepath=filepath, folder=folder_rel)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
_record_failure(filepath, f"Embedding failed: {e}")
|
_record_failure(filepath, f"Embedding failed: {e}")
|
||||||
@@ -113,7 +128,11 @@ def _ingest_one(filepath: Path, embedder, root: Path = None) -> int:
|
|||||||
print(f" Indexed {len(rows)} chunks: {filepath.name}")
|
print(f" Indexed {len(rows)} chunks: {filepath.name}")
|
||||||
_resolve_failure(filepath.name)
|
_resolve_failure(filepath.name)
|
||||||
if not os.getenv("SKIP_STAGE2_ENQUEUE"):
|
if not os.getenv("SKIP_STAGE2_ENQUEUE"):
|
||||||
enqueue_stage2(filepath.name, text)
|
full_text = "\n".join(
|
||||||
|
f"{b['heading']}\n{b['text']}" if b.get("heading") else b.get("text", "")
|
||||||
|
for b in blocks
|
||||||
|
)
|
||||||
|
enqueue_stage2(filepath.name, full_text)
|
||||||
return len(rows)
|
return len(rows)
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,136 @@
|
|||||||
|
"""
|
||||||
|
Orientation Indexer — feeds Stage 2's document-level orientations into pgvector
|
||||||
|
so they're searchable alongside chunk text by the retrieve_documents tool.
|
||||||
|
|
||||||
|
Each completed row in stage_3_queue has an `orientation` string (active_frames
|
||||||
|
+ frame_relationships + extraction_orientation + one_sentence_summary) that
|
||||||
|
describes the document at a conceptual level. Indexing it as its own row in
|
||||||
|
the embeddings table gives the cross-encoder a second surface to rank against
|
||||||
|
— "what is this document about" rather than just "what does this chunk say."
|
||||||
|
|
||||||
|
This worker is part of the "read-only Graphiti + orientation-into-pgvector"
|
||||||
|
plan B that replaced the Stage 3 → Graphiti write path. The graph layer is
|
||||||
|
queried directly via the search_facts chat tool; orientations land here.
|
||||||
|
|
||||||
|
State tracking: a row is considered indexed if the embeddings table already
|
||||||
|
holds a row with source=<source> and metadata->>'kind'='orientation'. The
|
||||||
|
worker is idempotent — restart-safe, resumable.
|
||||||
|
|
||||||
|
Runs as systemd: aaronai-orientation-indexer.service
|
||||||
|
"""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
import psycopg2
|
||||||
|
from sentence_transformers import SentenceTransformer
|
||||||
|
|
||||||
|
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||||
|
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent))
|
||||||
|
from encoding import write_embeddings_batch
|
||||||
|
|
||||||
|
PG_DSN = os.getenv("PG_DSN")
|
||||||
|
EMBED_MODEL = "all-MiniLM-L6-v2"
|
||||||
|
BATCH_SIZE = 25
|
||||||
|
POLL_INTERVAL_SECS = 30
|
||||||
|
LOG_FILE = "/var/log/aaronai/orientation-indexer.log"
|
||||||
|
HEARTBEAT_FILE = "/var/log/aaronai/orientation-indexer-heartbeat"
|
||||||
|
|
||||||
|
logging.basicConfig(
|
||||||
|
level=logging.INFO,
|
||||||
|
format="%(asctime)s [orientation-indexer] %(levelname)s %(message)s",
|
||||||
|
handlers=[logging.FileHandler(LOG_FILE, mode="a")],
|
||||||
|
)
|
||||||
|
log = logging.getLogger("orientation-indexer")
|
||||||
|
|
||||||
|
|
||||||
|
def get_pg():
|
||||||
|
return psycopg2.connect(PG_DSN)
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_unindexed(cur, limit):
|
||||||
|
"""Pull stage_3_queue rows with a non-null orientation whose orientation
|
||||||
|
hasn't been written to the embeddings table yet."""
|
||||||
|
cur.execute(
|
||||||
|
"""
|
||||||
|
SELECT s.source, s.orientation
|
||||||
|
FROM stage_3_queue s
|
||||||
|
WHERE s.orientation IS NOT NULL
|
||||||
|
AND NOT EXISTS (
|
||||||
|
SELECT 1 FROM embeddings e
|
||||||
|
WHERE e.source = s.source
|
||||||
|
AND e.metadata->>'kind' = 'orientation'
|
||||||
|
)
|
||||||
|
ORDER BY s.enqueued_at
|
||||||
|
LIMIT %s
|
||||||
|
""",
|
||||||
|
(limit,),
|
||||||
|
)
|
||||||
|
return cur.fetchall()
|
||||||
|
|
||||||
|
|
||||||
|
def _row_for(source: str, orientation: str, embedding) -> dict:
|
||||||
|
"""Build an embeddings row for the orientation. id is deterministic so
|
||||||
|
re-runs don't create duplicates if the unique check above ever races."""
|
||||||
|
import hashlib
|
||||||
|
chunk_id = hashlib.md5(f"orientation:{source}".encode()).hexdigest()[:8] + "_orient"
|
||||||
|
return {
|
||||||
|
"id": chunk_id,
|
||||||
|
"document": orientation,
|
||||||
|
"embedding": embedding,
|
||||||
|
"source": source,
|
||||||
|
"type": "document",
|
||||||
|
"metadata": {
|
||||||
|
"source": source,
|
||||||
|
"kind": "orientation",
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def write_heartbeat():
|
||||||
|
try:
|
||||||
|
Path(HEARTBEAT_FILE).write_text(str(time.time()))
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
log.info("Orientation indexer starting...")
|
||||||
|
log.info(f"Loading embedding model: {EMBED_MODEL}")
|
||||||
|
embedder = SentenceTransformer(EMBED_MODEL)
|
||||||
|
log.info("Embedding model ready.")
|
||||||
|
|
||||||
|
while True:
|
||||||
|
write_heartbeat()
|
||||||
|
try:
|
||||||
|
pg = get_pg()
|
||||||
|
try:
|
||||||
|
cur = pg.cursor()
|
||||||
|
rows = fetch_unindexed(cur, BATCH_SIZE)
|
||||||
|
if not rows:
|
||||||
|
pg.close()
|
||||||
|
time.sleep(POLL_INTERVAL_SECS)
|
||||||
|
continue
|
||||||
|
|
||||||
|
orientations = [r[1] for r in rows]
|
||||||
|
embeddings = embedder.encode(orientations).tolist()
|
||||||
|
batch = [
|
||||||
|
_row_for(source, orient, emb)
|
||||||
|
for (source, orient), emb in zip(rows, embeddings)
|
||||||
|
]
|
||||||
|
write_embeddings_batch(pg, batch)
|
||||||
|
log.info(f"Indexed {len(batch)} orientation(s)")
|
||||||
|
finally:
|
||||||
|
pg.close()
|
||||||
|
except Exception as e:
|
||||||
|
log.error(f"Indexing loop iteration failed: {e}")
|
||||||
|
time.sleep(POLL_INTERVAL_SECS)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -0,0 +1,146 @@
|
|||||||
|
"""One-off: re-ingest docx+pptx after the 2026-05-04 extractor upgrade (commit 93c0d89).
|
||||||
|
|
||||||
|
Pre-upgrade extraction missed tables, headers/footers, text boxes, group shapes,
|
||||||
|
and pptx notes — leaving CVs/dossiers as section-header skeletons in the index.
|
||||||
|
|
||||||
|
Steps when run with --apply:
|
||||||
|
1. DELETE all embeddings rows where source ends in .docx or .pptx
|
||||||
|
2. Walk NEXTCLOUD_PATH and re-ingest every .docx/.pptx via _ingest_one
|
||||||
|
3. Stage 2 enqueue is suppressed (SKIP_STAGE2_ENQUEUE=1)
|
||||||
|
|
||||||
|
Without --apply: dry-run. Counts files and chunks, prints a sample, writes nothing.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
os.environ["SKIP_STAGE2_ENQUEUE"] = "1"
|
||||||
|
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||||
|
|
||||||
|
import psycopg2
|
||||||
|
from sentence_transformers import SentenceTransformer
|
||||||
|
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent))
|
||||||
|
from ingest import _ingest_one, get_pg
|
||||||
|
|
||||||
|
NEXTCLOUD_PATH = Path("/home/aaron/nextcloud/data/data/aaron/files")
|
||||||
|
|
||||||
|
APPLY = "--apply" in sys.argv
|
||||||
|
_ext_args = [a for a in sys.argv[1:] if a.startswith("--ext=")]
|
||||||
|
if _ext_args:
|
||||||
|
TARGET_EXTS = {("." + e.lstrip(".")) for arg in _ext_args
|
||||||
|
for e in arg.split("=", 1)[1].split(",")}
|
||||||
|
else:
|
||||||
|
TARGET_EXTS = {".docx", ".pptx"}
|
||||||
|
|
||||||
|
|
||||||
|
def _ext_regex():
|
||||||
|
inner = "|".join(re.escape(e.lstrip(".")) for e in sorted(TARGET_EXTS))
|
||||||
|
return f"\\.({inner})$"
|
||||||
|
|
||||||
|
|
||||||
|
def count_stale():
|
||||||
|
pg = get_pg()
|
||||||
|
cur = pg.cursor()
|
||||||
|
cur.execute(
|
||||||
|
f"SELECT lower(substring(source from '\\.[^.]+$')) AS ext, "
|
||||||
|
f"COUNT(DISTINCT source) AS files, COUNT(*) AS chunks "
|
||||||
|
f"FROM embeddings WHERE lower(source) ~ '{_ext_regex()}' "
|
||||||
|
f"GROUP BY 1 ORDER BY 1"
|
||||||
|
)
|
||||||
|
rows = cur.fetchall()
|
||||||
|
pg.close()
|
||||||
|
return rows
|
||||||
|
|
||||||
|
|
||||||
|
def delete_stale():
|
||||||
|
pg = get_pg()
|
||||||
|
cur = pg.cursor()
|
||||||
|
cur.execute(f"DELETE FROM embeddings WHERE lower(source) ~ '{_ext_regex()}'")
|
||||||
|
deleted = cur.rowcount
|
||||||
|
pg.commit()
|
||||||
|
pg.close()
|
||||||
|
return deleted
|
||||||
|
|
||||||
|
|
||||||
|
def find_files():
|
||||||
|
files = []
|
||||||
|
for f in NEXTCLOUD_PATH.rglob("*"):
|
||||||
|
if not f.is_file():
|
||||||
|
continue
|
||||||
|
if f.suffix.lower() not in TARGET_EXTS:
|
||||||
|
continue
|
||||||
|
if f.name.startswith(("~$", ".")):
|
||||||
|
continue
|
||||||
|
files.append(f)
|
||||||
|
return files
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
print(f"Mode: {'APPLY (destructive)' if APPLY else 'DRY-RUN (no writes)'}")
|
||||||
|
print(f"Target: {NEXTCLOUD_PATH}")
|
||||||
|
print(f"Extensions: {sorted(TARGET_EXTS)}")
|
||||||
|
print(f"SKIP_STAGE2_ENQUEUE={os.environ.get('SKIP_STAGE2_ENQUEUE')}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
print("Stale chunks currently in DB:")
|
||||||
|
for ext, files, chunks in count_stale():
|
||||||
|
print(f" {ext}: {files} files, {chunks} chunks")
|
||||||
|
print()
|
||||||
|
|
||||||
|
files = find_files()
|
||||||
|
by_ext = {}
|
||||||
|
for f in files:
|
||||||
|
by_ext.setdefault(f.suffix.lower(), []).append(f)
|
||||||
|
print(f"Files on disk to re-ingest:")
|
||||||
|
for ext, lst in sorted(by_ext.items()):
|
||||||
|
print(f" {ext}: {len(lst)} files")
|
||||||
|
print(f" total: {len(files)}")
|
||||||
|
print()
|
||||||
|
print("Sample (5 random):")
|
||||||
|
import random
|
||||||
|
for f in random.sample(files, min(5, len(files))):
|
||||||
|
print(f" {f}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
if not APPLY:
|
||||||
|
print("Dry-run only. Re-run with --apply to delete + re-ingest.")
|
||||||
|
return
|
||||||
|
|
||||||
|
print("Deleting stale chunks...")
|
||||||
|
n = delete_stale()
|
||||||
|
print(f" deleted {n} rows")
|
||||||
|
print()
|
||||||
|
|
||||||
|
print("Loading embedder...")
|
||||||
|
embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
||||||
|
print()
|
||||||
|
|
||||||
|
print(f"Re-ingesting {len(files)} files...")
|
||||||
|
started = time.time()
|
||||||
|
ingested = failed = total_chunks = 0
|
||||||
|
for i, f in enumerate(files, 1):
|
||||||
|
n = _ingest_one(f, embedder, root=NEXTCLOUD_PATH)
|
||||||
|
if n > 0:
|
||||||
|
ingested += 1
|
||||||
|
total_chunks += n
|
||||||
|
else:
|
||||||
|
failed += 1
|
||||||
|
if i % 25 == 0 or i == len(files):
|
||||||
|
elapsed = time.time() - started
|
||||||
|
rate = i / elapsed if elapsed else 0
|
||||||
|
print(f" [{i}/{len(files)}] ingested={ingested} failed={failed} "
|
||||||
|
f"chunks={total_chunks} ({rate:.1f} files/s)")
|
||||||
|
elapsed = time.time() - started
|
||||||
|
print()
|
||||||
|
print(f"Done in {elapsed:.0f}s: {ingested} ingested, {failed} failed, "
|
||||||
|
f"{total_chunks} chunks written.")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -0,0 +1,123 @@
|
|||||||
|
"""One-off: remove embeddings rows that no longer correspond to a file on disk.
|
||||||
|
|
||||||
|
Two passes:
|
||||||
|
1. Modern rows (metadata.filepath set): check each filepath, delete if missing.
|
||||||
|
2. Legacy rows (metadata.filepath null): build a set of all basenames present
|
||||||
|
anywhere under NEXTCLOUD_PATH, then delete rows whose `source` basename
|
||||||
|
isn't in that set.
|
||||||
|
|
||||||
|
Default mode is a dry-run (counts + sample paths, no writes). Pass --apply to
|
||||||
|
actually delete.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
from collections import defaultdict
|
||||||
|
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||||
|
|
||||||
|
import psycopg2
|
||||||
|
|
||||||
|
NEXTCLOUD_PATH = Path("/home/aaron/nextcloud/data/data/aaron/files")
|
||||||
|
APPLY = "--apply" in sys.argv
|
||||||
|
|
||||||
|
|
||||||
|
def get_pg():
|
||||||
|
return psycopg2.connect(os.environ["PG_DSN"])
|
||||||
|
|
||||||
|
|
||||||
|
def scan_modern_orphans():
|
||||||
|
"""Rows with metadata.filepath whose file doesn't exist on disk."""
|
||||||
|
pg = get_pg()
|
||||||
|
cur = pg.cursor()
|
||||||
|
cur.execute(
|
||||||
|
"SELECT id, source, metadata->>'filepath' AS filepath "
|
||||||
|
"FROM embeddings WHERE metadata->>'filepath' IS NOT NULL"
|
||||||
|
)
|
||||||
|
orphans = []
|
||||||
|
by_source = defaultdict(int)
|
||||||
|
for row in cur.fetchall():
|
||||||
|
fp = row[2]
|
||||||
|
if fp and not Path(fp).exists():
|
||||||
|
orphans.append(row)
|
||||||
|
by_source[row[1]] += 1
|
||||||
|
pg.close()
|
||||||
|
return orphans, by_source
|
||||||
|
|
||||||
|
|
||||||
|
def scan_legacy_orphans():
|
||||||
|
"""Rows without metadata.filepath whose basename isn't anywhere under
|
||||||
|
NEXTCLOUD_PATH. Restricted to type='document' so conversations and memory
|
||||||
|
snapshots (which are synthetic sources, not files on disk) aren't flagged
|
||||||
|
as orphans. Walks the filesystem once to build the basename set."""
|
||||||
|
print(f" walking {NEXTCLOUD_PATH} to build basename index...")
|
||||||
|
on_disk = set()
|
||||||
|
for p in NEXTCLOUD_PATH.rglob("*"):
|
||||||
|
if p.is_file():
|
||||||
|
on_disk.add(p.name)
|
||||||
|
print(f" {len(on_disk):,} files on disk")
|
||||||
|
|
||||||
|
pg = get_pg()
|
||||||
|
cur = pg.cursor()
|
||||||
|
cur.execute(
|
||||||
|
"SELECT id, source FROM embeddings "
|
||||||
|
"WHERE metadata->>'filepath' IS NULL AND type = 'document'"
|
||||||
|
)
|
||||||
|
orphans = []
|
||||||
|
by_source = defaultdict(int)
|
||||||
|
for row in cur.fetchall():
|
||||||
|
if row[1] not in on_disk:
|
||||||
|
orphans.append(row)
|
||||||
|
by_source[row[1]] += 1
|
||||||
|
pg.close()
|
||||||
|
return orphans, by_source
|
||||||
|
|
||||||
|
|
||||||
|
def delete_rows(ids):
|
||||||
|
pg = get_pg()
|
||||||
|
cur = pg.cursor()
|
||||||
|
cur.execute("DELETE FROM embeddings WHERE id = ANY(%s)", (list(ids),))
|
||||||
|
deleted = cur.rowcount
|
||||||
|
pg.commit()
|
||||||
|
pg.close()
|
||||||
|
return deleted
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
print(f"Mode: {'APPLY (destructive)' if APPLY else 'DRY-RUN (no writes)'}")
|
||||||
|
print(f"Target: {NEXTCLOUD_PATH}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
print("Pass 1 — modern rows (metadata.filepath set):")
|
||||||
|
modern, modern_by_src = scan_modern_orphans()
|
||||||
|
print(f" {len(modern):,} orphan rows across {len(modern_by_src):,} files")
|
||||||
|
for src, n in sorted(modern_by_src.items(), key=lambda kv: -kv[1])[:10]:
|
||||||
|
print(f" {n:>4} chunks — {src}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
print("Pass 2 — legacy rows (no metadata.filepath):")
|
||||||
|
legacy, legacy_by_src = scan_legacy_orphans()
|
||||||
|
print(f" {len(legacy):,} orphan rows across {len(legacy_by_src):,} files")
|
||||||
|
for src, n in sorted(legacy_by_src.items(), key=lambda kv: -kv[1])[:10]:
|
||||||
|
print(f" {n:>4} chunks — {src}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
total = len(modern) + len(legacy)
|
||||||
|
if total == 0:
|
||||||
|
print("Nothing to delete.")
|
||||||
|
return
|
||||||
|
|
||||||
|
if not APPLY:
|
||||||
|
print(f"Dry-run only. Re-run with --apply to delete {total:,} rows.")
|
||||||
|
return
|
||||||
|
|
||||||
|
print(f"Deleting {total:,} orphan rows...")
|
||||||
|
n1 = delete_rows([r[0] for r in modern]) if modern else 0
|
||||||
|
n2 = delete_rows([r[0] for r in legacy]) if legacy else 0
|
||||||
|
print(f" modern: {n1:,} legacy: {n2:,} total: {n1 + n2:,}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -0,0 +1,53 @@
|
|||||||
|
"""End-to-end test of retrieve_context with intent routing + reranking.
|
||||||
|
|
||||||
|
Avoids loading the full FastAPI app; replicates the chat-handler retrieval
|
||||||
|
call shape and prints classifier output + final ranked sources for each query.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||||
|
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent))
|
||||||
|
|
||||||
|
# Stub anthropic so api.py import doesn't fail without the SDK loaded.
|
||||||
|
# We only need retrieve_context.
|
||||||
|
import types
|
||||||
|
sys.modules.setdefault("anthropic", types.ModuleType("anthropic"))
|
||||||
|
sys.modules["anthropic"].Anthropic = lambda **kw: None
|
||||||
|
|
||||||
|
# Same for whisper if present
|
||||||
|
if "faster_whisper" not in sys.modules:
|
||||||
|
sys.modules["faster_whisper"] = types.ModuleType("faster_whisper")
|
||||||
|
|
||||||
|
import importlib.util
|
||||||
|
spec = importlib.util.spec_from_file_location("api", Path(__file__).parent / "api.py")
|
||||||
|
api = importlib.util.module_from_spec(spec)
|
||||||
|
# Don't execute the whole module (it starts FastAPI). Instead, exec only definitions.
|
||||||
|
# Easier: just import the functions we need by exec'ing the file but catching errors.
|
||||||
|
try:
|
||||||
|
spec.loader.exec_module(api)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"(continuing despite api.py side-effect error: {e})")
|
||||||
|
|
||||||
|
retrieve_context = api.retrieve_context
|
||||||
|
|
||||||
|
QUERIES = [
|
||||||
|
"write me a bio",
|
||||||
|
"my professional bio",
|
||||||
|
"Aaron Nelson CV consulting and design work",
|
||||||
|
"FWN3D consulting",
|
||||||
|
"syllabi I have taught",
|
||||||
|
"philosophy of teaching",
|
||||||
|
"Hudson Valley Additive Manufacturing Center",
|
||||||
|
"Aaron Nelson is an artist and educator working in additive manufacturing",
|
||||||
|
]
|
||||||
|
|
||||||
|
for q in QUERIES:
|
||||||
|
pieces, sources = retrieve_context(q)
|
||||||
|
print(f"\n=== {q!r} ===")
|
||||||
|
for i, src in enumerate(sources, 1):
|
||||||
|
print(f" {i}. {src}")
|
||||||
+102
-6
@@ -29,7 +29,7 @@ from sentence_transformers import SentenceTransformer
|
|||||||
from watchdog.observers import Observer
|
from watchdog.observers import Observer
|
||||||
from watchdog.events import FileSystemEventHandler
|
from watchdog.events import FileSystemEventHandler
|
||||||
|
|
||||||
from encoding import extract_text, chunk_and_embed, write_embeddings_batch, SUPPORTED
|
from encoding import extract_blocks, chunk_and_embed, write_embeddings_batch, SUPPORTED
|
||||||
from failures import (
|
from failures import (
|
||||||
record_ingest_failure as _record_failure_sql,
|
record_ingest_failure as _record_failure_sql,
|
||||||
resolve_ingest_failure as _resolve_failure_sql,
|
resolve_ingest_failure as _resolve_failure_sql,
|
||||||
@@ -123,13 +123,61 @@ def resolve_ingest_failure(source: str):
|
|||||||
log.warning(f"Could not resolve ingest failure record (non-fatal): {e}")
|
log.warning(f"Could not resolve ingest failure record (non-fatal): {e}")
|
||||||
|
|
||||||
|
|
||||||
|
def delete_embeddings_for_path(filepath: Path):
|
||||||
|
"""Remove embeddings rows for a file that no longer exists. Matches by
|
||||||
|
metadata.filepath so multi-folder same-basename files don't collide.
|
||||||
|
Legacy rows without filepath metadata are left alone — they get cleaned
|
||||||
|
by sweep_orphans.py."""
|
||||||
|
try:
|
||||||
|
pg = get_pg()
|
||||||
|
try:
|
||||||
|
cur = pg.cursor()
|
||||||
|
cur.execute(
|
||||||
|
"DELETE FROM embeddings WHERE metadata->>'filepath' = %s",
|
||||||
|
(str(filepath),),
|
||||||
|
)
|
||||||
|
deleted = cur.rowcount
|
||||||
|
pg.commit()
|
||||||
|
if deleted:
|
||||||
|
log.info(f"Deleted {deleted} chunks for removed file: {filepath}")
|
||||||
|
finally:
|
||||||
|
pg.close()
|
||||||
|
except Exception as e:
|
||||||
|
log.warning(f"Could not delete embeddings for {filepath} (non-fatal): {e}")
|
||||||
|
|
||||||
|
|
||||||
|
def remove_from_state(filepath: Path):
|
||||||
|
"""Drop a deleted file from watcher_state.json so it isn't carried as
|
||||||
|
'known mtime' indefinitely."""
|
||||||
|
try:
|
||||||
|
state = load_state()
|
||||||
|
key = str(filepath)
|
||||||
|
if key in state:
|
||||||
|
del state[key]
|
||||||
|
save_state(state)
|
||||||
|
except Exception as e:
|
||||||
|
log.warning(f"Could not update state for deleted {filepath} (non-fatal): {e}")
|
||||||
|
|
||||||
|
|
||||||
|
IGNORED_TOP_FOLDERS = {"Drafts"}
|
||||||
|
|
||||||
|
|
||||||
def ingest_file(filepath: Path, embedder) -> int:
|
def ingest_file(filepath: Path, embedder) -> int:
|
||||||
if filepath.name.startswith(("~$", "~", ".")):
|
if filepath.name.startswith(("~$", "~", ".")):
|
||||||
return 0
|
return 0
|
||||||
if filepath.suffix.lower() not in SUPPORTED:
|
if filepath.suffix.lower() not in SUPPORTED:
|
||||||
return 0
|
return 0
|
||||||
text = extract_text(filepath)
|
try:
|
||||||
if not text.strip():
|
rel = filepath.parent.relative_to(NEXTCLOUD_PATH)
|
||||||
|
if rel.parts and rel.parts[0] in IGNORED_TOP_FOLDERS:
|
||||||
|
return 0
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
|
blocks = extract_blocks(filepath)
|
||||||
|
if not blocks or not any(
|
||||||
|
(b.get("text") or "").strip() or (b.get("heading") or "").strip()
|
||||||
|
for b in blocks
|
||||||
|
):
|
||||||
record_ingest_failure(filepath, "Text extraction failed or empty")
|
record_ingest_failure(filepath, "Text extraction failed or empty")
|
||||||
return 0
|
return 0
|
||||||
folder_rel = None
|
folder_rel = None
|
||||||
@@ -138,7 +186,7 @@ def ingest_file(filepath: Path, embedder) -> int:
|
|||||||
except ValueError:
|
except ValueError:
|
||||||
pass
|
pass
|
||||||
try:
|
try:
|
||||||
rows = chunk_and_embed(text, filepath.name, embedder,
|
rows = chunk_and_embed(blocks, filepath.name, embedder,
|
||||||
filepath=filepath, folder=folder_rel)
|
filepath=filepath, folder=folder_rel)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
log.error(f"Embedding failed for {filepath.name}: {e}")
|
log.error(f"Embedding failed for {filepath.name}: {e}")
|
||||||
@@ -159,7 +207,11 @@ def ingest_file(filepath: Path, embedder) -> int:
|
|||||||
return 0
|
return 0
|
||||||
log.info(f"Indexed {len(rows)} chunks: {filepath.name}")
|
log.info(f"Indexed {len(rows)} chunks: {filepath.name}")
|
||||||
resolve_ingest_failure(source)
|
resolve_ingest_failure(source)
|
||||||
enqueue_stage2(source, text)
|
full_text = "\n".join(
|
||||||
|
f"{b['heading']}\n{b['text']}" if b.get("heading") else b.get("text", "")
|
||||||
|
for b in blocks
|
||||||
|
)
|
||||||
|
enqueue_stage2(source, full_text)
|
||||||
return len(rows)
|
return len(rows)
|
||||||
|
|
||||||
|
|
||||||
@@ -207,6 +259,12 @@ def get_changed_files(state: dict) -> list:
|
|||||||
continue
|
continue
|
||||||
if "Computational Design 2017" in path.parts and "Student Work" in path.parts:
|
if "Computational Design 2017" in path.parts and "Student Work" in path.parts:
|
||||||
continue
|
continue
|
||||||
|
if path.name in ("Renders.pptx", "Ribbon Cutting Slideshow.pptx") \
|
||||||
|
and "Presentations" in path.parts:
|
||||||
|
continue
|
||||||
|
if path.name == "GH Slicer Notes [Autosaved].pptx" \
|
||||||
|
and "DDF555 3D Computational" in path.parts:
|
||||||
|
continue
|
||||||
if path.stat().st_size == 0:
|
if path.stat().st_size == 0:
|
||||||
continue
|
continue
|
||||||
if state.get(str(path)) != str(path.stat().st_mtime):
|
if state.get(str(path)) != str(path.stat().st_mtime):
|
||||||
@@ -297,6 +355,12 @@ class IngestHandler(FileSystemEventHandler):
|
|||||||
return True
|
return True
|
||||||
if "Computational Design 2017" in path.parts and "Student Work" in path.parts:
|
if "Computational Design 2017" in path.parts and "Student Work" in path.parts:
|
||||||
return True
|
return True
|
||||||
|
if path.name in ("Renders.pptx", "Ribbon Cutting Slideshow.pptx") \
|
||||||
|
and "Presentations" in path.parts:
|
||||||
|
return True
|
||||||
|
if path.name == "GH Slicer Notes [Autosaved].pptx" \
|
||||||
|
and "DDF555 3D Computational" in path.parts:
|
||||||
|
return True
|
||||||
return False
|
return False
|
||||||
|
|
||||||
def on_created(self, event):
|
def on_created(self, event):
|
||||||
@@ -322,15 +386,47 @@ class IngestHandler(FileSystemEventHandler):
|
|||||||
def on_moved(self, event):
|
def on_moved(self, event):
|
||||||
if event.is_directory:
|
if event.is_directory:
|
||||||
return
|
return
|
||||||
|
src = Path(event.src_path)
|
||||||
|
dest = Path(event.dest_path)
|
||||||
|
# If destination is outside NEXTCLOUD_PATH (e.g., Nextcloud trashbin at
|
||||||
|
# /home/aaron/nextcloud/data/data/aaron/files_trashbin/), treat as a
|
||||||
|
# delete — the file is no longer in the watched corpus.
|
||||||
|
try:
|
||||||
|
dest.relative_to(NEXTCLOUD_PATH)
|
||||||
|
except ValueError:
|
||||||
|
if src.suffix.lower() in SUPPORTED:
|
||||||
|
log.info(f"Event: moved out of tree {src} -> {dest}")
|
||||||
|
threading.Thread(
|
||||||
|
target=lambda: (
|
||||||
|
delete_embeddings_for_path(src),
|
||||||
|
remove_from_state(src),
|
||||||
|
),
|
||||||
|
daemon=True,
|
||||||
|
).start()
|
||||||
|
return
|
||||||
# Nextcloud WebDAV writes .part temp files then renames to final path.
|
# Nextcloud WebDAV writes .part temp files then renames to final path.
|
||||||
# src_path is the .part file; dest_path is the final filename.
|
# src_path is the .part file; dest_path is the final filename.
|
||||||
dest = Path(event.dest_path)
|
|
||||||
if dest.suffix.lower() not in SUPPORTED or self._should_ignore(dest):
|
if dest.suffix.lower() not in SUPPORTED or self._should_ignore(dest):
|
||||||
return
|
return
|
||||||
log.info(f"Event: moved -> {dest}")
|
log.info(f"Event: moved -> {dest}")
|
||||||
self.pending = True
|
self.pending = True
|
||||||
self.last_event = time.time()
|
self.last_event = time.time()
|
||||||
|
|
||||||
|
def on_deleted(self, event):
|
||||||
|
if event.is_directory:
|
||||||
|
return
|
||||||
|
path = Path(event.src_path)
|
||||||
|
if path.suffix.lower() not in SUPPORTED:
|
||||||
|
return
|
||||||
|
log.info(f"Event: deleted {path}")
|
||||||
|
threading.Thread(
|
||||||
|
target=lambda: (
|
||||||
|
delete_embeddings_for_path(path),
|
||||||
|
remove_from_state(path),
|
||||||
|
),
|
||||||
|
daemon=True,
|
||||||
|
).start()
|
||||||
|
|
||||||
def on_closed(self, event):
|
def on_closed(self, event):
|
||||||
# FileClosedEvent fires on the final file after Nextcloud completes write.
|
# FileClosedEvent fires on the final file after Nextcloud completes write.
|
||||||
# Belt-and-suspenders catch for any write pattern not caught by on_moved.
|
# Belt-and-suspenders catch for any write pattern not caught by on_moved.
|
||||||
|
|||||||
Reference in New Issue
Block a user