Compare commits
44 Commits
1101bef226
..
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 5582549321 | |||
| 3ec9a48151 | |||
| 9d09d3fa14 | |||
| f185ed60cb | |||
| a4735053c2 | |||
| f682d8c6a0 | |||
| 151c756b89 | |||
| e96bf40b2f | |||
| 313c0f0341 | |||
| d2ec20e373 | |||
| 10bb29290a | |||
| 9bb083f065 | |||
| 430ea239dd | |||
| 0a1e2b4f61 | |||
| 8c2c597687 | |||
| fda61ad622 | |||
| 84994f9282 | |||
| 9e86297e2a | |||
| 9955c7e383 | |||
| 50b97e2998 | |||
| 8d560f9f5e | |||
| 732e450d21 | |||
| 63c58b5bb3 | |||
| 6c2af55e7e | |||
| 5b4a299414 | |||
| b09e35892c | |||
| e38d283e59 | |||
| 8e61e4dedb | |||
| 7b77794319 | |||
| d985f9e91e | |||
| b9eea6cb62 | |||
| 93c0d89308 | |||
| f18fb64fe5 | |||
| 72e07afc03 | |||
| c3011c80a5 | |||
| 4204806c80 | |||
| c5fc517fef | |||
| b35d44ef58 | |||
| a27f22ceaf | |||
| 7c7b649775 | |||
| 3c7c228db0 | |||
| 2df1a2fe01 | |||
| ed2d090afc | |||
| e5898f3019 |
@@ -8,6 +8,7 @@ dreamer_state.json
|
||||
corpus_integrity_report.json
|
||||
watcher_state.json
|
||||
watcher_status.json
|
||||
reindex_status.json
|
||||
|
||||
# Logs (these belong in /var/log/)
|
||||
*.log
|
||||
|
||||
@@ -65,6 +65,38 @@ The watcher (`watcher.py` + `aaronai-watcher.service`) is a clean Stage 1 that m
|
||||
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Updates — 2026-05-03 session
|
||||
|
||||
*Layered updates from Track 1 improvement work on 2026-05-03. The 2026-05-02 inventory above is preserved as a point-in-time snapshot; corrections and resolutions are recorded here with provenance.*
|
||||
|
||||
### Resolved
|
||||
|
||||
- **NREM-shape divergence #1 (cumulative cross-night exclusion 500-cap, `dream.py`) — RESOLVED.** Replaced cumulative `retrieved_sources` with session-scoped novelty. Early REM now excludes only NREM high-scorers from the current session; Late REM excludes the current session's NREM ∪ Early REM. Legacy `retrieved_sources` key cleared from `dreamer_state.json`. Verification: post-fix dream-manifest source count rose to 24 (vs. 13 / 16 on the two prior comparable runs) — the previously-hidden ~40% of corpus is now reachable to Early/Late REM as the architecture and reframe specify. NREM exclusion fix from 2026-05-02 preserved.
|
||||
|
||||
### Corrections to existing findings
|
||||
|
||||
- **`stage2_metadata` location (Phase 1, `stage2_worker.py`):** the metadata column lives on `stage_3_queue.stage2_metadata` (jsonb), **not on `stage_2_queue`**. `stage_2_queue` has only basic queue fields (`id, source, full_text, char_length, timestamps, failure_reason, attempts`). The 2026-05-02 entry implied otherwise. Corrected via direct schema inspection on 2026-05-03.
|
||||
|
||||
- **Stage 2 char_length gate (Phase 1, `stage2_worker.py`):** the `char_length < 2000` check at line 139 runs *before* the Mistral call at line 149. For sub-2000-char docs, Mistral is **never invoked** — the worker logs `Processing → Skipping Stage 3 → completed_at = NOW()` with no Mistral pass between them. The earlier framing of "documents under 2000 chars skip Stage 3" was correct as written, but the implied "Stage 2 produces orientation metadata for everything" architecture commitment is not what the code does. 339 of 1,041 completed Stage 2 docs (33%) have **no frame data extracted at all**, not "frame data extracted then discarded."
|
||||
|
||||
### New findings from 2026-05-03 frame analysis (Improvement #3)
|
||||
|
||||
- **`ingest_conversations.py` bypasses Stage 2 entirely.** 198 distinct conversation sources (`Claude:`, `ChatGPT:`, `Aaron AI:`, plus `type='aaronai_conversation'`) write directly to pgvector `embeddings` and never enter `stage_2_queue`. Conversations have **zero frame coverage by design**, not by accident. Combined with the 339-doc char-gate exclusion and 12 Stage 2 failures, **only 56% of the embeddings corpus has any frame data**. Same NREM shape — a routing decision the architecture didn't explicitly request, doing something silently that the architecture's "Stage 2 produces orientation for everything" commitment denies.
|
||||
|
||||
- **Voice notes (14) and dream outputs (39) are systematically excluded from the frame system.** Within the 339-doc <2000-char gap: all 14 voice notes and all 39 dreamer-output files (NREM, Early REM, Late REM, synthesis markdown) are present. Voice is one of Aaron's primary capture channels. Dream outputs are the dreamer's own reflection. Both are silent to the frame system that orients downstream extraction — meaning the dreamer cannot frame-condition on its own output. Same NREM shape as the others.
|
||||
|
||||
- **File-type × frame stratification signal exists and is currently unused** (cross-link to Phase 3 `embeddings.type` finding). The 2026-05-03 frame analysis (`docs/stage2-frame-analysis-2026-05-03.md` §5) shows that within frame-extracted docs, "Programming" pivots to pptx (n=15), "Application" pivots to pdf (n=13), Education spreads across pdf+docx — file type adds discriminating signal to frame routing. Currently `embeddings.type` is NULL for 71% of rows; backfilling it (Improvement #2, not yet applied) would make this stratification queryable at retrieval time instead of reverse-engineerable from filenames.
|
||||
|
||||
### Artifacts produced 2026-05-03
|
||||
|
||||
- **Code change:** `scripts/dream.py` (Improvement #1).
|
||||
- **New SQL view:** `stage2_frames_v` (over `stage_3_queue.stage2_metadata`; `CREATE OR REPLACE`, idempotent, drop with `DROP VIEW stage2_frames_v;`).
|
||||
- **New analysis script:** `scripts/experiments/frame_distribution_report.py` (read-only).
|
||||
- **JSON sidecar:** `experiments/frame_distribution_2026-05-03.json`.
|
||||
- **Report:** `docs/stage2-frame-analysis-2026-05-03.md`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Scripts
|
||||
|
||||
@@ -0,0 +1,105 @@
|
||||
# OCR install record — 2026-05-04
|
||||
|
||||
## Machine
|
||||
|
||||
- Host: aaronai-01 (VPS)
|
||||
- OS: Ubuntu 24.04 noble (kernel 6.8.0-110-generic, x86_64)
|
||||
|
||||
## apt packages installed
|
||||
|
||||
| package | version | source |
|
||||
|---|---|---|
|
||||
| tesseract-ocr | 5.3.4-1build5 | noble |
|
||||
| tesseract-ocr-eng | 1:4.1.0-2 | noble |
|
||||
| tesseract-ocr-osd | 1:4.1.0-2 | noble (automatic) |
|
||||
| libtesseract5 | 5.3.4-1build5 | noble (automatic) |
|
||||
|
||||
## pip packages installed (into /home/aaron/aaronai/venv)
|
||||
|
||||
| package | version |
|
||||
|---|---|
|
||||
| pytesseract | 0.3.13 |
|
||||
| ocrmypdf | 17.4.2 |
|
||||
|
||||
Direct dependencies pulled in by the two installs above (also new in venv): `pikepdf 10.5.1`, `pdfminer-six 20260107`, `pypdfium2 5.7.1`, `img2pdf 0.6.3`, `pi-heif 1.3.0`, `cryptography 47.0.0`, `cffi 2.0.0`, `pycparser 3.0`, `Deprecated 1.3.1`, `deprecation 2.1.0`, `defusedxml 0.7.1`, `fonttools 4.62.1`, `fpdf2 2.8.7`, `uharfbuzz 0.54.1`, `wrapt 2.1.2`, `pluggy 1.6.0`. `pillow` was already at 12.2.0.
|
||||
|
||||
## Smoke test 1 — `tesseract --version`
|
||||
|
||||
```
|
||||
tesseract 5.3.4
|
||||
leptonica-1.82.0
|
||||
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
|
||||
Found AVX512BW
|
||||
Found AVX512F
|
||||
```
|
||||
|
||||
## Smoke test 2 — `tesseract --list-langs`
|
||||
|
||||
```
|
||||
List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (2):
|
||||
eng
|
||||
osd
|
||||
```
|
||||
|
||||
## Smoke test 3 — pytesseract on a slide image
|
||||
|
||||
- Input pptx: `/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF555 3D Computational/GH Slicer Notes.pptx`
|
||||
- Extracted image: `ppt/media/image1.PNG` (1768×504 PNG)
|
||||
- Wall-clock: 0.220s
|
||||
- Chars extracted: 126
|
||||
- First 200 chars:
|
||||
|
||||
```
|
||||
Generates the Bounding Box for NESS
|
||||
|
||||
round(x, 4), round(y, 4), round(z, 4), round(a, 4))
|
||||
|
||||
Format ("HSS5 X(0} ¥(1} W(2} H(3)",
|
||||
```
|
||||
|
||||
Note: the first image in `Renders.pptx` (image1.jpg, 640×480) returned 0 chars on first attempt. Sampled 15 images in `Renders.pptx`; all 15 are pure rendered designs/photographs with no text. Switched to `GH Slicer Notes.pptx` (per the original 4-image-only-pptx candidate list) where image1.PNG is a textual code-screenshot. Tesseract behavior is correct in both cases; `Renders.pptx` is not a useful OCR test target because it contains no text. Some character-recognition noise on the code screenshot (e.g. `¥(1}` for `Y(1)`, mojibake on parentheses/braces) — acceptable for a baseline smoke; production tuning is a worker-design concern.
|
||||
|
||||
## Smoke test 4 — ocrmypdf on a Lexmark CX510de scan
|
||||
|
||||
- Input PDF: `/home/aaron/nextcloud/data/data/aaron/files/Admin/Dossier/Tenure/Dossier Scan 2022/image2022-01-07-133846 - CAryn.pdf` (4 pages, Producer: Lexmark CX510de, Creator: HardCopy)
|
||||
- Command: `ocrmypdf --skip-text -l eng <input> /tmp/ocr_smoke/caryn_ocred.pdf`
|
||||
- Wall-clock: 3.72s (whole PDF, 4 pages)
|
||||
- Exit: 0
|
||||
- After OCR, `pdftotext` on the output produced 2347 chars (2270 non-whitespace).
|
||||
- First 200 chars of OCR'd text:
|
||||
|
||||
```
|
||||
nN New Paltz
|
||||
STATE UNIVERSITY OF NEW YORK
|
||||
|
||||
The Honors Program
|
||||
|
||||
May 30, 2017
|
||||
|
||||
Dear Aaron,
|
||||
|
||||
Thank you for serving as a reader for Caryn Byllott’s thesis on "Recall/Reconstruct: The Exploration of
|
||||
Memory
|
||||
```
|
||||
|
||||
Real readable English. The "nN" header is the Lexmark logo glyph; otherwise clean. ~0.93s/page on this scan, which is the reference number for sizing the async worker queue.
|
||||
|
||||
## Reference timing
|
||||
|
||||
| operation | input size | wall-clock |
|
||||
|---|---|---|
|
||||
| pytesseract single image | 1768×504 PNG | 0.22s |
|
||||
| ocrmypdf 4-page scan | 4 pages, ~A4 | 3.72s (~0.93s/page) |
|
||||
|
||||
## Deferred — project dep-tracking
|
||||
|
||||
The project has no dependency manifest on disk: no `requirements.txt`, `pyproject.toml`, `setup.py`, `Pipfile`, or `poetry.lock`. Pip deps live only in `venv/`. The OCR install adds `pytesseract` and `ocrmypdf` (plus their transitive closure listed above) to that untracked venv state.
|
||||
|
||||
This commit does not introduce a manifest. Tracking the dep-manifest decision as its own followup; the natural deadline is the capture-path integration commit, where `import pytesseract` will become load-bearing in the repo. If the manifest question is unresolved by then, that integration commit is the right place to address it.
|
||||
|
||||
## Followups
|
||||
|
||||
- Async OCR worker (separate session). Use the reference timing above to size the queue.
|
||||
- Capture path integration: phone-camera images → `pytesseract.image_to_string` → existing chunk/embed pipeline.
|
||||
- Backlog processing of 75 scanned PDFs (Lexmark CX510de and similar) and the 4 image-only pptx (`Renders.pptx`, `Ribbon Cutting Slideshow.pptx`, two `GH Slicer Notes` variants). Per the smoke results, `Renders.pptx` is unlikely to yield useful OCR text — it is rendered-design content, not scanned documents — and may instead need exclusion rather than processing.
|
||||
- Project dep-manifest decision (see Deferred section above).
|
||||
@@ -0,0 +1,175 @@
|
||||
# Stage 2 Frame Analysis — 2026-05-03
|
||||
|
||||
*Improvement #3 of three Track 1 improvements. Read-only report on the frame data Stage 2 produces, in service of Track 2 substrate design (Step 2.4 operation set spec).*
|
||||
|
||||
**Data source:** `stage_3_queue.stage2_metadata` (jsonb), exposed via the new SQL view `stage2_frames_v`. Analysis script: `scripts/experiments/frame_distribution_report.py`. Sidecar JSON: `experiments/frame_distribution_2026-05-03.json`. **Stage 3 service is currently stopped, so this is a stable snapshot.**
|
||||
|
||||
---
|
||||
|
||||
## Verdict
|
||||
|
||||
**Frames cluster meaningfully but coverage is partial.** Frame distribution is skewed (one frame, "Education", appears in 36% of frame-extracted docs) but not degenerate — the top 20 frames carry recognizable domain signal, file-type bins differentiate them further, and per-doc frame counts are healthy. **However, only 56% of the embeddings corpus has any frame data at all.** The other 44% — conversations, short files, voice notes, dream outputs — has zero frame coverage by design, not by accident.
|
||||
|
||||
Frame-conditional routing is a viable γ component candidate **for the document side of the corpus**. It is not a viable router for the conversational or self-generated side without filling the coverage hole.
|
||||
|
||||
---
|
||||
|
||||
## 1. Corpus-wide frame coverage
|
||||
|
||||
| Class | Count | % of corpus | Frame coverage |
|
||||
|---|---|---|---|
|
||||
| Total distinct sources in `embeddings` | 1,255 | 100% | — |
|
||||
| Files with frames (`stage_3_queue.stage2_metadata`) | 704 | 56.1% | yes |
|
||||
| Conversations (Claude / ChatGPT / Aaron AI) | 198 | 15.8% | **none — bypass Stage 2 by design** |
|
||||
| Files <2,000 chars (Stage 2 char-gate skip) | 339 | 27.0% | **none — Mistral never invoked** |
|
||||
| Files that failed Stage 2 | 12 | 1.0% | none |
|
||||
|
||||
**56.1% frame coverage** is the headline. The architectural reason for the gap is twofold:
|
||||
|
||||
1. **`ingest_conversations.py` writes directly to `embeddings`** with `type='aaronai_conversation'` and never enqueues to `stage_2_queue`. Conversations have never been frame-extracted, full stop.
|
||||
2. **`stage2_worker.py:139` gates Mistral on char_length.** Docs <2,000 chars are marked complete with `completed_at = NOW()` *before* Mistral runs. The Mistral cost is not paid for these (correction to my earlier framing in the inventory) — but neither is any frame data produced.
|
||||
|
||||
## 2. Frame distribution (the docs that DO have frames)
|
||||
|
||||
**668 docs, 1,374 distinct frame labels. Top-20 by count:**
|
||||
|
||||
| Frame | Count | % of frame-extracted docs |
|
||||
|---|---|---|
|
||||
| Education | 238 | 35.6% |
|
||||
| Course | 58 | 8.7% |
|
||||
| Programming | 43 | 6.4% |
|
||||
| Design | 32 | 4.8% |
|
||||
| Professional Experience | 24 | 3.6% |
|
||||
| Employment | 24 | 3.6% |
|
||||
| Research | 23 | 3.4% |
|
||||
| 3D Printing | 22 | 3.3% |
|
||||
| Project, Grading, Art, Budget | 21 each | 3.1% |
|
||||
| Academic Integrity | 20 | 3.0% |
|
||||
| Teaching, Technology, Attendance, Application | 13–19 | — |
|
||||
| Accommodation, Manufacturing, Coursework, Recommendation | 10–13 | — |
|
||||
|
||||
**Per-doc frame count:** median 3–4 frames per doc; 76% of docs have 3–5 frames; one outlier doc has 30 frames (Mistral over-segmented).
|
||||
|
||||
**Long tail is enormous.** 1,374 distinct labels for 668 docs means most labels appear once. Mistral is producing a near-open vocabulary, not a clean taxonomy.
|
||||
|
||||
**"Education" is the universal frame.** It dominates co-occurrence pairs (8 of the top-10 pairs include Education). Education functions as a near-tautology for this corpus and carries less discriminating signal than narrower frames like "Programming" or "3D Printing."
|
||||
|
||||
## 3. Label hygiene
|
||||
|
||||
**54 normalized collisions** detected (case-insensitive, underscore-vs-space):
|
||||
|
||||
| Concept | Variant counts |
|
||||
|---|---|
|
||||
| Professional Experience | `Professional Experience`:24 + `Professional_Experience`:6 |
|
||||
| 3D Printing | `3D Printing`:22 + `3D_Printing`:7 |
|
||||
| Academic Integrity | `Academic Integrity`:20 + `Academic_Integrity`:2 |
|
||||
| Course Design | `Course Design`:9 + `Course_Design`:1 |
|
||||
| Project Management | `Project Management`:7 + `Project_Management`:1 |
|
||||
| Computational Design | `Computational Design`:7 + `Computational_Design`:1 |
|
||||
| (… 48 more) | |
|
||||
|
||||
Without normalization, ~30+ documents have their frames silently split across spelling variants for the same concept. Any frame-conditional router must normalize before counting. Recommended canonical form: lowercase, single-space, hyphens preserved.
|
||||
|
||||
## 4. Worker version drift
|
||||
|
||||
| Worker version | Doc count | Notes |
|
||||
|---|---|---|
|
||||
| v2.1 | 665 | Two ad-hoc-key intrusions: `academic_details` (1 doc), `additional_information` (1 doc). Mistral occasionally invents extra structured keys not in the prompt schema. |
|
||||
| v2.0 | 3 | Same key shape as v2.1 baseline. |
|
||||
|
||||
Schema is stable across the version transition for this dataset. The ad-hoc keys are a Mistral quirk (instruction-following variance), not a worker bug. **For Track 2 substrate ingest, plan for `stage2_metadata` to occasionally include unexpected top-level keys.**
|
||||
|
||||
## 5. File-type signal
|
||||
|
||||
This is the most useful Track 2 finding from this report.
|
||||
|
||||
`stage_3_queue.source` stores bare filenames, so I bin by file-type suffix. Frames stratify cleanly:
|
||||
|
||||
| Frame | pdf | docx | pptx | markdown | txt | dream |
|
||||
|---|---|---|---|---|---|---|
|
||||
| Education | 116 | 119 | 3 | — | — | — |
|
||||
| Course | 29 | 29 | — | — | — | — |
|
||||
| Programming | 12 | 10 | **15** | — | 6 | — |
|
||||
| Application | **13** | 2 | — | — | — | — |
|
||||
| 3D Printing | 11 | 3 | **8** | — | — | — |
|
||||
| Manufacturing | 3 | 6 | 4 | — | — | — |
|
||||
| Research | 9 | 13 | — | 1 | — | — |
|
||||
|
||||
**Concrete signal:** "Programming" pivots toward pptx (slide decks), "Application" pivots toward pdf (compiled PDFs), Education spreads across pdf+docx (syllabi and dossiers). File type is essentially free signal — the watcher already knows it — and it disambiguates frames that the model treats as equivalent. **`embeddings.type` is currently NULL for 71% of rows per inventory finding 5; backfilling that field (Improvement #2) makes file-type signal actually queryable instead of reverse-engineerable from filenames.**
|
||||
|
||||
## 6. Systematic exclusions inside the 339-doc gap
|
||||
|
||||
Of the 339 short docs that bypass frame extraction, the breakdown by file type:
|
||||
|
||||
| Type | Count | What this is |
|
||||
|---|---|---|
|
||||
| pdf | 110 | Short PDFs (forms, single-page docs) |
|
||||
| docx | 110 | Short Word docs |
|
||||
| dream_output | 39 | **The dreamer's own NREM/Early-REM/Late-REM/synthesis files** |
|
||||
| pptx | 31 | Short slide decks |
|
||||
| txt | 28 | Plain-text files |
|
||||
| voice_note | 14 | **Every voice note in the corpus** |
|
||||
| markdown | 7 | Short markdown |
|
||||
|
||||
**Two specific systematic exclusions worth naming separately:**
|
||||
|
||||
- **All 14 voice notes have no frames.** Voice is one of Aaron's primary capture channels. The frame system is silent on it.
|
||||
- **All 39 dream outputs have no frames.** The dreamer's writing is invisible to the frame system that orients the dreamer's own next pass. The system cannot frame-condition on its own output.
|
||||
|
||||
These are NREM-shape findings: the architecture's frame extraction is *quietly* not running on whole categories of input that the architecture treats as first-class. Recommended for the inventory.
|
||||
|
||||
---
|
||||
|
||||
## 7. Would frame-conditional routing be a viable γ component, and what would it condition on?
|
||||
|
||||
**Viable on the framed-doc subset, subject to validation on larger samples for §5 stratification.** The 56% of corpus with frames shows real distributional signal; the 44% gap is unrouted. Conditions for the framed-doc subset:
|
||||
|
||||
1. **Normalize labels before any routing decision.** 54 collision groups today; the router must operate on normalized canonical form, not raw Mistral output. Add a normalization layer between Mistral and any consumer.
|
||||
2. **Treat "Education" as a near-universal prior, not a frame.** It carries low routing signal because it's everywhere. Either drop it from the conditional, or use it as the *base case* and condition on the secondary frame. (See §8 follow-up — the dominance may be a Mistral prompt artifact rather than a corpus shape; cheap diagnostic available.)
|
||||
3. **Combine frames with file type, not frames alone.** Frame × file-type stratifies more cleanly than frame alone (see §5). The §5 cross-tab is suggestive — Programming → pptx (n=15), Application → pdf (n=13) — but cell counts are small and need validation on a larger sample before being load-bearing for substrate design.
|
||||
|
||||
**What it would condition on:** the joint of (normalized frame set, file type, doc length bucket). Concretely, a Track 2 router could compute `P(this doc is relevant to current goal | frames ∩ goal_frames, file_type, length)` rather than using a fixed cosine similarity threshold. Frames give the topic axis; file type gives the genre axis; length gives the granularity axis.
|
||||
|
||||
**Defined scope (the coverage caveat):**
|
||||
|
||||
The router only works on the 56% of corpus that has frames. To extend to the full corpus, Track 2 has three options:
|
||||
|
||||
- **(a) Backfill frames for short docs and conversations.** Run Mistral on the 339 short docs (cheap — they're short) and on the 198 conversations. This makes frames a corpus-wide signal at the cost of a one-time Mistral run.
|
||||
- **(b) Use a degraded fallback for unframed docs.** File-type signal is available for short files; conversation type is available for conversations. Route those by their available signal; route framed docs by frame+type.
|
||||
- **(c) Accept the gap as a scope limit.** The router only operates on long, non-conversation files. The 44% gap is unrouted (whatever the current default is).
|
||||
|
||||
(a) is the most general and the most aligned with the architecture's stated commitment ("Stage 2 produces orientation metadata for everything"). Mistral cost on 537 short docs is small. **Recommend (a) before any router work begins.**
|
||||
|
||||
---
|
||||
|
||||
## 8. Recommended follow-ups (ordered by ROI)
|
||||
|
||||
1. **Backfill the 339 short docs.** Run a one-shot script that bypasses the char_length gate and runs Mistral on them. The voice notes and dream outputs are the highest priorities — primary capture and primary self-reflection channels currently silent.
|
||||
2. **Backfill conversations into frame extraction.** Either modify `ingest_conversations.py` to enqueue Stage 2, or run a one-shot conversation-frame extraction pass. This is the larger backfill (198 conversations, multiple chunks each) but it removes the conversational coverage hole.
|
||||
3. **Add a frame-label normalizer at the worker.** New rows write a normalized canonical form alongside the raw Mistral output. Older rows can be normalized at query time via the view.
|
||||
4. **Decide whether to deprecate "Education" as a frame.** It's so universal in this corpus that it adds noise. Either drop it from Mistral's prompt, or downweight it in any router that conditions on frames.
|
||||
5. **Per-frame retrieval-similarity follow-up (deferred from this report).** Now that we know frames cluster meaningfully, instrumenting `dream.py` to record per-source similarity per stage becomes worthwhile. That tells us whether retrieval implicitly prefers certain frames already.
|
||||
|
||||
6. **Diagnose the "Education" dominance: prompt artifact vs. corpus shape.** Education appears in 36% of frame-extracted docs. Two hypotheses: (a) Mistral's prompt biases toward institutional/academic framings (prompt artifact); (b) the corpus genuinely is dominated by academic/teaching content (corpus shape). Cheap diagnostic: hand-inspect 20 random docs tagged "Education", classify as *truly academic content* vs. *Education was a default Mistral reached for*. If the split is mostly (b), Education is honest signal and the router should treat it as a base case; if mostly (a), revise the Mistral prompt to discourage default tags. 20-doc sample is small enough to do in one sitting, large enough to distinguish the hypotheses at >70/30 splits.
|
||||
|
||||
---
|
||||
|
||||
## 9. Inventory edits flagged for session-end batch
|
||||
|
||||
- **Correction:** `stage2_metadata` lives on `stage_3_queue.stage2_metadata` (jsonb), not on `stage_2_queue` as the inventory implied. The Phase 1 / `stage2_worker.py` entry should be corrected.
|
||||
- **New finding:** the char_length gate runs *before* the Mistral call (`stage2_worker.py:139` precedes `:147`). For the 339 sub-2000-char docs, Mistral is never invoked. Reframes the architecture's "Stage 2 extracts orientation for everything" commitment.
|
||||
- **New finding:** `ingest_conversations.py` bypasses Stage 2 entirely. 198 conversation sources have zero frame coverage by design. Same NREM shape as #1 — a routing decision the architecture didn't explicitly request.
|
||||
- **New finding (cross-link to #2):** `embeddings.type` NULL-rate findings now have a concrete read consumer. File-type signal would unlock the frame × file-type stratification described in §5.
|
||||
- **New finding:** Within the 339-doc data gap, two systematic categorical exclusions are worth naming separately: **all 14 voice notes** and **all 39 dream outputs** are in the gap. Voice is one of Aaron's primary capture channels; dream outputs are the dreamer's own self-generated reflection. Both are silent to the frame system that orients downstream extraction — which means the dreamer cannot frame-condition on its own output. Same NREM shape as the others — a routing decision the architecture didn't explicitly request.
|
||||
|
||||
## 10. Reproduction
|
||||
|
||||
```bash
|
||||
cd ~/aaronai
|
||||
venv/bin/python3 scripts/experiments/frame_distribution_report.py
|
||||
# stdout: human-readable report
|
||||
# json: experiments/frame_distribution_<date>.json
|
||||
# view: stage2_frames_v (in pgvector DB)
|
||||
```
|
||||
|
||||
The view is `CREATE OR REPLACE`, idempotent. Drop with `DROP VIEW stage2_frames_v;` if needed.
|
||||
@@ -0,0 +1,857 @@
|
||||
{
|
||||
"generated_at": "2026-05-03T23:47:54.802182+00:00",
|
||||
"section_1": {
|
||||
"overall": {
|
||||
"total": 14069,
|
||||
"type_null": 9815,
|
||||
"ca_null": 12109,
|
||||
"both_null": 9815,
|
||||
"both_set": 1960
|
||||
},
|
||||
"cohorts": [
|
||||
{
|
||||
"type": "aaronai_conversation",
|
||||
"ca_null": false,
|
||||
"n": 71
|
||||
},
|
||||
{
|
||||
"type": "chatgpt_conversation",
|
||||
"ca_null": true,
|
||||
"n": 1548
|
||||
},
|
||||
{
|
||||
"type": "claude_conversation",
|
||||
"ca_null": false,
|
||||
"n": 1074
|
||||
},
|
||||
{
|
||||
"type": "claude_memory",
|
||||
"ca_null": true,
|
||||
"n": 1
|
||||
},
|
||||
{
|
||||
"type": "document",
|
||||
"ca_null": false,
|
||||
"n": 815
|
||||
},
|
||||
{
|
||||
"type": "document",
|
||||
"ca_null": true,
|
||||
"n": 745
|
||||
},
|
||||
{
|
||||
"type": null,
|
||||
"ca_null": true,
|
||||
"n": 9815
|
||||
}
|
||||
]
|
||||
},
|
||||
"section_2": {
|
||||
"by_ext": [
|
||||
{
|
||||
"ext": ".pdf",
|
||||
"rows": 6886
|
||||
},
|
||||
{
|
||||
"ext": ".txt",
|
||||
"rows": 1501
|
||||
},
|
||||
{
|
||||
"ext": ".docx",
|
||||
"rows": 1048
|
||||
},
|
||||
{
|
||||
"ext": ".pptx",
|
||||
"rows": 353
|
||||
},
|
||||
{
|
||||
"ext": ".md",
|
||||
"rows": 27
|
||||
}
|
||||
],
|
||||
"classified": 9815,
|
||||
"unclassifiable": 0
|
||||
},
|
||||
"section_3": {
|
||||
"watcher_state_paths": 1462,
|
||||
"watcher_state_basenames": 1183,
|
||||
"watcher_state_collisions": 109,
|
||||
"rows_with_filepath": {
|
||||
"total": 9816,
|
||||
"exists": 9649,
|
||||
"missing": 167,
|
||||
"outside_root": 0,
|
||||
"sample": [
|
||||
{
|
||||
"id": "f317f238_0",
|
||||
"source": "NO thesis proposal.docx",
|
||||
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF790 Thesis/Nic OConnor/NO thesis proposal.docx",
|
||||
"mtime": "2024-01-26T15:06:09Z"
|
||||
},
|
||||
{
|
||||
"id": "81047646_0",
|
||||
"source": "Metals II Syllabus.pdf",
|
||||
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Professional/Job Applications/Job Apps Fall 2015/App State/Metals II Syllabus.pdf",
|
||||
"mtime": "2012-02-26T22:45:15Z"
|
||||
},
|
||||
{
|
||||
"id": "81047646_1",
|
||||
"source": "Metals II Syllabus.pdf",
|
||||
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Professional/Job Applications/Job Apps Fall 2015/App State/Metals II Syllabus.pdf",
|
||||
"mtime": "2012-02-26T22:45:15Z"
|
||||
},
|
||||
{
|
||||
"id": "4e49d3b4_4",
|
||||
"source": "Circuit Intro.pdf",
|
||||
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF310 Mechatronics/Week 1/Circuit Intro.pdf",
|
||||
"mtime": "2022-01-31T23:28:56Z"
|
||||
},
|
||||
{
|
||||
"id": "81047646_2",
|
||||
"source": "Metals II Syllabus.pdf",
|
||||
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Professional/Job Applications/Job Apps Fall 2015/App State/Metals II Syllabus.pdf",
|
||||
"mtime": "2012-02-26T22:45:15Z"
|
||||
}
|
||||
]
|
||||
},
|
||||
"rows_without_filepath": {
|
||||
"total": 744,
|
||||
"distinct_basenames": 228,
|
||||
"unique_hit": 211,
|
||||
"collision_hit": 16,
|
||||
"unfound": 1
|
||||
},
|
||||
"collision_shapes": {
|
||||
"total": 109,
|
||||
"shape_counts": {
|
||||
"multi-live": 95,
|
||||
"live+archive": 14
|
||||
},
|
||||
"rows_affected_by_shape": {
|
||||
"multi-live": 85,
|
||||
"live+archive": 0
|
||||
},
|
||||
"samples": {
|
||||
"multi-live": [
|
||||
{
|
||||
"name": "README.md",
|
||||
"rows_no_fp_using_this_name": 0,
|
||||
"candidates": [
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/README.md",
|
||||
"mtime": "2026-04-25T17:08:01Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Processing/Nature of Code/The-Nature-of-Code-Examples/The-Nature-of-Code-Examples-master/README.md",
|
||||
"mtime": "2017-03-09T23:32:59Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/samples/hal/README.md",
|
||||
"mtime": "2016-12-21T10:37:05Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/platforms/maven/README.md",
|
||||
"mtime": "2016-12-21T10:37:05Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/README.md",
|
||||
"mtime": "2016-12-21T10:37:03Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/3rdparty/openvx/README.md",
|
||||
"mtime": "2016-12-21T10:37:03Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/3rdparty/openvx/hal/README.md",
|
||||
"mtime": "2016-12-21T10:37:03Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/3rdparty/carotene/README.md",
|
||||
"mtime": "2016-12-21T10:37:02Z"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "3DPrinting_v2.pptx",
|
||||
"rows_no_fp_using_this_name": 4,
|
||||
"candidates": [
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Presentations/Invited/Innovation Center/3DPrinting_v2.pptx",
|
||||
"mtime": "2026-04-24T19:34:49Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Presentations/Invited/Cuba/Assets/3DPrinting_v2.pptx",
|
||||
"mtime": "2026-04-24T19:34:18Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Presentations/Conference/3D Printing/3DPrinting_v2.pptx",
|
||||
"mtime": "2026-04-24T19:34:15Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Workshops/3DPrinting_v2.pptx",
|
||||
"mtime": "2026-04-24T19:30:14Z"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "Print in Place.docx",
|
||||
"rows_no_fp_using_this_name": 0,
|
||||
"candidates": [
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF205 CAD1/Print in Place.docx",
|
||||
"mtime": "2017-08-24T03:50:36Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Academic/ARS393 CVS1/Print in Place.docx",
|
||||
"mtime": "2015-10-28T20:36:52Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"live+archive": [
|
||||
{
|
||||
"name": "dreamer-design-spec.md",
|
||||
"rows_no_fp_using_this_name": 0,
|
||||
"candidates": [
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Journal/dreamer-design-spec.md",
|
||||
"mtime": "2026-04-25T22:55:11Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Archive/dreamer-design-spec.md",
|
||||
"mtime": "2026-04-25T22:55:11Z"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "BirdAI-Ingest-Architecture.md",
|
||||
"rows_no_fp_using_this_name": 0,
|
||||
"candidates": [
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Journal/BirdAI-Ingest-Architecture.md",
|
||||
"mtime": "2026-04-28T00:08:38Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Archive/BirdAI-Ingest-Architecture.md",
|
||||
"mtime": "2026-04-28T00:08:38Z"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "graphiti-migration-plan.md",
|
||||
"rows_no_fp_using_this_name": 0,
|
||||
"candidates": [
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Journal/graphiti-migration-plan.md",
|
||||
"mtime": "2026-04-27T17:54:40Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Archive/Migration Plans/graphiti-migration-plan.md",
|
||||
"mtime": "2026-04-27T17:54:40Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"section_4": {
|
||||
"export_dir_exists": true,
|
||||
"files": [
|
||||
{
|
||||
"name": "conversations-000.json",
|
||||
"size": 19050556,
|
||||
"mtime": "2026-04-24T19:55:44Z"
|
||||
},
|
||||
{
|
||||
"name": "conversations-001.json",
|
||||
"size": 29057594,
|
||||
"mtime": "2026-04-24T19:55:44Z"
|
||||
}
|
||||
],
|
||||
"convo_index_size": 169,
|
||||
"sample_results": [
|
||||
{
|
||||
"id": "chatgpt_87cc0c47-aaf9-42da-8169-3b8922f3afba_0",
|
||||
"source": "ChatGPT: Dog named Bird",
|
||||
"convo_id": "87cc0c47-aaf9-42da-8169-3b8922f3afba",
|
||||
"create_time": 1708835138.51948,
|
||||
"create_time_iso": "2024-02-25T04:25:38.519480Z",
|
||||
"resolved": true
|
||||
},
|
||||
{
|
||||
"id": "chatgpt_689fab3e-d79c-8333-aeb5-7da4e9ca160d_0",
|
||||
"source": "ChatGPT: Video understanding limitations",
|
||||
"convo_id": "689fab3e-d79c-8333-aeb5-7da4e9ca160d",
|
||||
"create_time": 1755294541.894811,
|
||||
"create_time_iso": "2025-08-15T21:49:01.894811Z",
|
||||
"resolved": true
|
||||
},
|
||||
{
|
||||
"id": "chatgpt_611ff391-7fc0-42ea-bfd9-18dbe1739f19_7",
|
||||
"source": "ChatGPT: Calculating Truncated Cone Angle",
|
||||
"convo_id": "611ff391-7fc0-42ea-bfd9-18dbe1739f19",
|
||||
"create_time": 1724020869.471264,
|
||||
"create_time_iso": "2024-08-18T22:41:09.471264Z",
|
||||
"resolved": true
|
||||
},
|
||||
{
|
||||
"id": "chatgpt_68ce1921-084c-8330-877c-78df1e03e54c_50",
|
||||
"source": "ChatGPT: Soul music playlist ideas",
|
||||
"convo_id": "68ce1921-084c-8330-877c-78df1e03e54c",
|
||||
"create_time": 1758337313.438344,
|
||||
"create_time_iso": "2025-09-20T03:01:53.438344Z",
|
||||
"resolved": true
|
||||
},
|
||||
{
|
||||
"id": "chatgpt_c02e94f0-17db-4fd9-be04-13aaa1b728cb_1",
|
||||
"source": "ChatGPT: Create Rhino plugin in Python",
|
||||
"convo_id": "c02e94f0-17db-4fd9-be04-13aaa1b728cb",
|
||||
"create_time": 1682716259.557353,
|
||||
"create_time_iso": "2023-04-28T21:10:59.557353Z",
|
||||
"resolved": true
|
||||
}
|
||||
],
|
||||
"sample_resolved": 5,
|
||||
"full_cohort": {
|
||||
"distinct_convo_ids": 168,
|
||||
"resolvable_from_export": 168,
|
||||
"unresolvable": 0
|
||||
}
|
||||
},
|
||||
"section_5": {
|
||||
"earliest_per_type": [
|
||||
{
|
||||
"type": "aaronai_conversation",
|
||||
"earliest": "2026-04-26T17:43:28.056503",
|
||||
"latest": "2026-05-03T01:45:21.469613",
|
||||
"rows": 71
|
||||
},
|
||||
{
|
||||
"type": "claude_conversation",
|
||||
"earliest": "2026-02-28T20:33:36.146998Z",
|
||||
"latest": "2026-04-23T04:26:00.015419Z",
|
||||
"rows": 1074
|
||||
},
|
||||
{
|
||||
"type": "document",
|
||||
"earliest": "2026-04-30 16:42:55.360736+00",
|
||||
"latest": "2026-05-03 20:14:33.13663+00",
|
||||
"rows": 815
|
||||
}
|
||||
],
|
||||
"git_findings": [
|
||||
"037d7475738352dd13620486b5154d58fa6c037b 2026-04-28 00:15:46 +0000 chore: archive deprecated chromadb and migration scripts",
|
||||
"67766371789276ec4bcb8bac271b6eb9ddafa888 2026-04-27 05:16:37 +0000 Remove hardcoded PG password fallbacks \u2014 require PG_DSN env var in all scripts",
|
||||
"f78b83042bf2bb3d95c3604ee5d4431e76b103df 2026-04-26 21:16:04 +0000 Migrate to pgvector \u2014 remove ChromaDB from api.py, ingest scripts, dream.py",
|
||||
"8c8fba11b8d1b359b9b7722fc19b6ef562b812d8 2026-04-26 21:28:40 +0000 Add nightly conversation indexing \u2014 Aaron AI conversations into pgvector at 2:30AM",
|
||||
"f78b83042bf2bb3d95c3604ee5d4431e76b103df 2026-04-26 21:16:04 +0000 Migrate to pgvector \u2014 remove ChromaDB from api.py, ingest scripts, dream.py",
|
||||
"d2eed9890665a78a37fb5d336e8af75e7f2acb42 2026-04-26 20:19:49 +0000 Pre-pgvector migration checkpoint \u2014 upsert, allow_replace_deleted, maintenance timer"
|
||||
],
|
||||
"chromadb_candidates": [],
|
||||
"proposed_sentinel": "2026-04-26T00:00:00Z",
|
||||
"reasoning": "git f78b830 'Migrate to pgvector \u2014 remove ChromaDB from api.py, ingest scripts, dream.py' is dated 2026-04-26. The earliest type='document' row with a non-NULL created_at lands 2026-04-30 (the F11 canonical-encoding cutover). Rows with NULL created_at all predate F11 and most predate the pgvector cutover itself. 2026-04-26 is the date the ChromaDB->pgvector migration script was committed, so any row currently in the embeddings table with NULL created_at must have been ingested on or after that date (when the table came into existence in current form). It is the tightest defensible upper bound on 'the row entered pgvector before timestamps were tracked', so it is the right sentinel."
|
||||
},
|
||||
"section_6": [
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "f66c7390_6",
|
||||
"source": "Design Guide - FDM for Composite Tooling 2.0.pdf",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2023-08-24T18:17:01Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "9cf798f8_151",
|
||||
"source": "Shop Class as Soulcraft An inquiry into the value of the -- Crawford, Matthew.pdf",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-30T21:17:40.708026Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "fc378df0_329",
|
||||
"source": "ulysses.txt",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2017-10-12T14:20:59Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "812bd5c6_0",
|
||||
"source": "Bennington College Cover Letter.pdf",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2013-03-29T20:32:23Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "91ccefdd_185",
|
||||
"source": "Cognition in the Wild (A Bradford Book) -- Hutchins, Edwin.pdf",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-25T17:21:35Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "48fa3d53_2",
|
||||
"source": "CMakeLists.txt",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2016-12-21T10:37:05Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "49e3545d_9",
|
||||
"source": "RH50-TM-L1-EN-20140902.pdf",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2014-09-02T18:44:08Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "a8366d89_144",
|
||||
"source": "Hackers and Painters_ Big Ideas from the Computer Age -- Graham, Paul.pdf",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-24T22:25:03Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "3e3097f8_46",
|
||||
"source": "The Nature and Art of Workmanship -- David Pye.pdf",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-24T22:24:03Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "87f9a5cf_269",
|
||||
"source": "Supersizing the Mind_ Embodiment, Action, and Cognitive -- Andy Clark.pdf",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-25T17:14:25Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "cd3d1914_61",
|
||||
"source": "The world beyond your head _ on becoming an individual in an -- Crawford, Matthew B.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-27T16:04:25Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "592a1366_0",
|
||||
"source": "2026-04-29-synthesis.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-29T08:00:57.634567Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "cfb0a691_3",
|
||||
"source": "Consolidator-0.1-Specification.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-29T03:34:31Z",
|
||||
"inferred_ca_source": "watcher_state_unique"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "cd3d1914_57",
|
||||
"source": "The world beyond your head _ on becoming an individual in an -- Crawford, Matthew B.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-27T16:04:25Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "e65ef61c_8",
|
||||
"source": "BirdAI-Research-Context.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-29T15:57:07Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "4dce2922_3",
|
||||
"source": "cascade-optimization-protocol.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-28T05:46:24Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "077cc52d_1",
|
||||
"source": "graphiti-migration-plan.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-27T17:54:40Z",
|
||||
"inferred_ca_source": "watcher_state_collision_pick_latest_of_2"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "db356b14_70",
|
||||
"source": "Finite and infinite games -- James Carse.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-27T06:11:55Z",
|
||||
"inferred_ca_source": "watcher_state_collision_pick_latest_of_2"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "1f15bccf_38",
|
||||
"source": "BirdAI-Experiments-Log.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-05-01T16:40:02Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "db356b14_13",
|
||||
"source": "Finite and infinite games -- James Carse.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-27T06:11:55Z",
|
||||
"inferred_ca_source": "watcher_state_collision_pick_latest_of_2"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_68fd20c6-d838-832d-90f4-154f63281f49_30",
|
||||
"source": "ChatGPT: External review for tenure",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_691d6420-f544-8329-ae4b-f2b78da44c0e_7",
|
||||
"source": "ChatGPT: Website styling changes",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_67fc4254-ef50-8009-9e0f-81864cca7cec_1",
|
||||
"source": "ChatGPT: Job Application Review",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_68f3d936-d74c-8329-91df-fe838e292170_5",
|
||||
"source": "ChatGPT: SEC coaches with OSU ties",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_691d1b5b-bb4c-832b-8d2e-11a86a569fcc_4",
|
||||
"source": "ChatGPT: Hosting app platforms",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_bfa1cd2f-b8ab-4b11-b844-c47b2fa70612_1",
|
||||
"source": "ChatGPT: New chat",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_68ce1921-084c-8330-877c-78df1e03e54c_37",
|
||||
"source": "ChatGPT: Soul music playlist ideas",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_68fd20c6-d838-832d-90f4-154f63281f49_10",
|
||||
"source": "ChatGPT: External review for tenure",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_691d6420-f544-8329-ae4b-f2b78da44c0e_10",
|
||||
"source": "ChatGPT: Website styling changes",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_690286bd-0758-8332-8491-5d00c77f4696_1",
|
||||
"source": "ChatGPT: Airbrushing and finishing setup",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "6ef0e329_0",
|
||||
"source": "schematic-substrate-analysis.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-05-01 16:42:13.360795+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-05-01 16:42:13.360795+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "02db1224_208",
|
||||
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "ead32317_93",
|
||||
"source": "Richard Sennett - The Craftsman.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-04-30 22:23:34.012202+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-30 22:23:34.012202+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "6ef0e329_4",
|
||||
"source": "schematic-substrate-analysis.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-05-01 16:42:13.360795+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-05-01 16:42:13.360795+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "02db1224_175",
|
||||
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "02db1224_101",
|
||||
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "02db1224_268",
|
||||
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "6ef0e329_5",
|
||||
"source": "schematic-substrate-analysis.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-05-01 16:42:13.360795+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-05-01 16:42:13.360795+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "ead32317_132",
|
||||
"source": "Richard Sennett - The Craftsman.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-04-30 22:23:34.012202+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-30 22:23:34.012202+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "02db1224_86",
|
||||
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-claude (type='claude_conversation', ca set)",
|
||||
"id": "claude_dacf89e3-1ee7-400d-8461-ef5920c82fe3_96",
|
||||
"source": "Claude: University of Utah interview teaching example",
|
||||
"existing_type": "claude_conversation",
|
||||
"existing_ca": "2026-03-11T18:05:57.594832Z",
|
||||
"inferred_type": "claude_conversation",
|
||||
"inferred_ca": "2026-03-11T18:05:57.594832Z",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-claude (type='claude_conversation', ca set)",
|
||||
"id": "claude_c0baf4b0-a7bb-4664-ac7b-98d7b02f56a6_26",
|
||||
"source": "Claude: Weighing Utah versus Oklahoma",
|
||||
"existing_type": "claude_conversation",
|
||||
"existing_ca": "2026-04-01T19:08:26.722197Z",
|
||||
"inferred_type": "claude_conversation",
|
||||
"inferred_ca": "2026-04-01T19:08:26.722197Z",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-claude (type='claude_conversation', ca set)",
|
||||
"id": "claude_bbe0172d-3087-4238-a51c-7dca6c0b6f28_92",
|
||||
"source": "Claude: Setting up a custom OpenClaw instance",
|
||||
"existing_type": "claude_conversation",
|
||||
"existing_ca": "2026-04-23T04:26:00.015419Z",
|
||||
"inferred_type": "claude_conversation",
|
||||
"inferred_ca": "2026-04-23T04:26:00.015419Z",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-claude (type='claude_conversation', ca set)",
|
||||
"id": "claude_42dbddc5-12ba-4de7-a685-043473189da9_6",
|
||||
"source": "Claude: I filling out my annual report...",
|
||||
"existing_type": "claude_conversation",
|
||||
"existing_ca": "2026-03-24T14:34:47.870625Z",
|
||||
"inferred_type": "claude_conversation",
|
||||
"inferred_ca": "2026-03-24T14:34:47.870625Z",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-claude (type='claude_conversation', ca set)",
|
||||
"id": "claude_bbe0172d-3087-4238-a51c-7dca6c0b6f28_1344",
|
||||
"source": "Claude: Setting up a custom OpenClaw instance",
|
||||
"existing_type": "claude_conversation",
|
||||
"existing_ca": "2026-04-23T04:26:00.015419Z",
|
||||
"inferred_type": "claude_conversation",
|
||||
"inferred_ca": "2026-04-23T04:26:00.015419Z",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
|
||||
"id": "aaronai_conv_28ee8a447d3fc922_6",
|
||||
"source": "Aaron AI: I'm working on you",
|
||||
"existing_type": "aaronai_conversation",
|
||||
"existing_ca": "2026-04-26T17:43:28.056503",
|
||||
"inferred_type": "aaronai_conversation",
|
||||
"inferred_ca": "2026-04-26T17:43:28.056503",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
|
||||
"id": "aaronai_conv_7deef2e8001f0e45_20",
|
||||
"source": "Aaron AI: Who's covering for me on sabbatical?",
|
||||
"existing_type": "aaronai_conversation",
|
||||
"existing_ca": "2026-04-29T22:19:45.312349",
|
||||
"inferred_type": "aaronai_conversation",
|
||||
"inferred_ca": "2026-04-29T22:19:45.312349",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
|
||||
"id": "aaronai_conv_21cabf771708df70_42",
|
||||
"source": "Aaron AI: What should I be the most excited about right now?",
|
||||
"existing_type": "aaronai_conversation",
|
||||
"existing_ca": "2026-04-27T07:06:03.996026",
|
||||
"inferred_type": "aaronai_conversation",
|
||||
"inferred_ca": "2026-04-27T07:06:03.996026",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
|
||||
"id": "aaronai_conv_7deef2e8001f0e45_12",
|
||||
"source": "Aaron AI: Who's covering for me on sabbatical?",
|
||||
"existing_type": "aaronai_conversation",
|
||||
"existing_ca": "2026-04-29T22:19:45.312349",
|
||||
"inferred_type": "aaronai_conversation",
|
||||
"inferred_ca": "2026-04-29T22:19:45.312349",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
|
||||
"id": "aaronai_conv_ed40b4278a9c8110_4",
|
||||
"source": "Aaron AI: Let's say you're building an analog of the human brain, and ...",
|
||||
"existing_type": "aaronai_conversation",
|
||||
"existing_ca": "2026-05-03T01:45:21.469613",
|
||||
"inferred_type": "aaronai_conversation",
|
||||
"inferred_ca": "2026-05-03T01:45:21.469613",
|
||||
"inferred_ca_source": "preserved"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,987 @@
|
||||
{
|
||||
"generated_at": "2026-05-03T20:21:33.558462",
|
||||
"n_docs_with_frames": 668,
|
||||
"n_distinct_labels": 1374,
|
||||
"top_30_frames": [
|
||||
[
|
||||
"Education",
|
||||
238
|
||||
],
|
||||
[
|
||||
"Course",
|
||||
58
|
||||
],
|
||||
[
|
||||
"Programming",
|
||||
43
|
||||
],
|
||||
[
|
||||
"Design",
|
||||
32
|
||||
],
|
||||
[
|
||||
"Professional Experience",
|
||||
24
|
||||
],
|
||||
[
|
||||
"Employment",
|
||||
24
|
||||
],
|
||||
[
|
||||
"Research",
|
||||
23
|
||||
],
|
||||
[
|
||||
"3D Printing",
|
||||
22
|
||||
],
|
||||
[
|
||||
"Project",
|
||||
21
|
||||
],
|
||||
[
|
||||
"Grading",
|
||||
21
|
||||
],
|
||||
[
|
||||
"Art",
|
||||
21
|
||||
],
|
||||
[
|
||||
"Budget",
|
||||
21
|
||||
],
|
||||
[
|
||||
"Academic Integrity",
|
||||
20
|
||||
],
|
||||
[
|
||||
"Teaching",
|
||||
19
|
||||
],
|
||||
[
|
||||
"Technology",
|
||||
18
|
||||
],
|
||||
[
|
||||
"Attendance",
|
||||
17
|
||||
],
|
||||
[
|
||||
"Application",
|
||||
15
|
||||
],
|
||||
[
|
||||
"Accommodation",
|
||||
13
|
||||
],
|
||||
[
|
||||
"Manufacturing",
|
||||
13
|
||||
],
|
||||
[
|
||||
"Coursework",
|
||||
11
|
||||
],
|
||||
[
|
||||
"Recommendation",
|
||||
10
|
||||
],
|
||||
[
|
||||
"Manufacturing Process",
|
||||
10
|
||||
],
|
||||
[
|
||||
"Additive Manufacturing",
|
||||
10
|
||||
],
|
||||
[
|
||||
"Job Application",
|
||||
10
|
||||
],
|
||||
[
|
||||
"Exhibitions",
|
||||
10
|
||||
],
|
||||
[
|
||||
"Academic Administration",
|
||||
9
|
||||
],
|
||||
[
|
||||
"Communication",
|
||||
9
|
||||
],
|
||||
[
|
||||
"Course Design",
|
||||
9
|
||||
],
|
||||
[
|
||||
"Veteran and Military Services",
|
||||
9
|
||||
],
|
||||
[
|
||||
"Career",
|
||||
9
|
||||
]
|
||||
],
|
||||
"label_collisions": {
|
||||
"conversational": [
|
||||
[
|
||||
"Conversational",
|
||||
1
|
||||
],
|
||||
[
|
||||
"conversational",
|
||||
1
|
||||
]
|
||||
],
|
||||
"content": [
|
||||
[
|
||||
"Content",
|
||||
1
|
||||
],
|
||||
[
|
||||
"content",
|
||||
1
|
||||
]
|
||||
],
|
||||
"cascade": [
|
||||
[
|
||||
"Cascade",
|
||||
1
|
||||
],
|
||||
[
|
||||
"cascade",
|
||||
1
|
||||
]
|
||||
],
|
||||
"education": [
|
||||
[
|
||||
"Education",
|
||||
238
|
||||
],
|
||||
[
|
||||
"education",
|
||||
1
|
||||
]
|
||||
],
|
||||
"academic record": [
|
||||
[
|
||||
"Academic_Record",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Academic Record",
|
||||
1
|
||||
]
|
||||
],
|
||||
"independent study": [
|
||||
[
|
||||
"Independent Study",
|
||||
5
|
||||
],
|
||||
[
|
||||
"Independent_Study",
|
||||
2
|
||||
]
|
||||
],
|
||||
"project management": [
|
||||
[
|
||||
"Project Management",
|
||||
7
|
||||
],
|
||||
[
|
||||
"Project_Management",
|
||||
1
|
||||
]
|
||||
],
|
||||
"digital fabrication": [
|
||||
[
|
||||
"Digital Fabrication",
|
||||
6
|
||||
],
|
||||
[
|
||||
"digital_fabrication",
|
||||
1
|
||||
],
|
||||
[
|
||||
"digital fabrication",
|
||||
1
|
||||
]
|
||||
],
|
||||
"project proposal": [
|
||||
[
|
||||
"Project_Proposal",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Project Proposal",
|
||||
2
|
||||
]
|
||||
],
|
||||
"academic integrity": [
|
||||
[
|
||||
"Academic Integrity",
|
||||
20
|
||||
],
|
||||
[
|
||||
"Academic_Integrity",
|
||||
2
|
||||
]
|
||||
],
|
||||
"3d printing": [
|
||||
[
|
||||
"3D Printing",
|
||||
22
|
||||
],
|
||||
[
|
||||
"3D_Printing",
|
||||
7
|
||||
]
|
||||
],
|
||||
"technical skills": [
|
||||
[
|
||||
"Technical Skills",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Technical_Skills",
|
||||
1
|
||||
]
|
||||
],
|
||||
"course structure": [
|
||||
[
|
||||
"Course Structure",
|
||||
7
|
||||
],
|
||||
[
|
||||
"Course_Structure",
|
||||
1
|
||||
]
|
||||
],
|
||||
"course design": [
|
||||
[
|
||||
"Course Design",
|
||||
9
|
||||
],
|
||||
[
|
||||
"Course_Design",
|
||||
1
|
||||
]
|
||||
],
|
||||
"product design": [
|
||||
[
|
||||
"Product Design",
|
||||
6
|
||||
],
|
||||
[
|
||||
"Product_Design",
|
||||
1
|
||||
]
|
||||
],
|
||||
"professional experience": [
|
||||
[
|
||||
"Professional Experience",
|
||||
24
|
||||
],
|
||||
[
|
||||
"Professional_Experience",
|
||||
6
|
||||
]
|
||||
],
|
||||
"disability accommodations": [
|
||||
[
|
||||
"Disability Accommodations",
|
||||
4
|
||||
],
|
||||
[
|
||||
"Disability_Accommodations",
|
||||
1
|
||||
]
|
||||
],
|
||||
"material science": [
|
||||
[
|
||||
"Material_Science",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Material Science",
|
||||
4
|
||||
]
|
||||
],
|
||||
"computational design": [
|
||||
[
|
||||
"Computational Design",
|
||||
7
|
||||
],
|
||||
[
|
||||
"Computational_Design",
|
||||
1
|
||||
]
|
||||
],
|
||||
"computer services policy": [
|
||||
[
|
||||
"Computer Services Policy",
|
||||
6
|
||||
],
|
||||
[
|
||||
"Computer_Services_Policy",
|
||||
1
|
||||
]
|
||||
],
|
||||
"work experience": [
|
||||
[
|
||||
"Work_Experience",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Work Experience",
|
||||
3
|
||||
]
|
||||
],
|
||||
"academic program": [
|
||||
[
|
||||
"Academic Program",
|
||||
7
|
||||
],
|
||||
[
|
||||
"Academic_Program",
|
||||
1
|
||||
]
|
||||
],
|
||||
"project-based learning": [
|
||||
[
|
||||
"Project-Based Learning",
|
||||
5
|
||||
],
|
||||
[
|
||||
"Project-Based_Learning",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Project-based Learning",
|
||||
2
|
||||
]
|
||||
],
|
||||
"art and design": [
|
||||
[
|
||||
"Art and Design",
|
||||
6
|
||||
],
|
||||
[
|
||||
"Art_and_Design",
|
||||
1
|
||||
]
|
||||
],
|
||||
"fdm technology": [
|
||||
[
|
||||
"FDM_Technology",
|
||||
2
|
||||
],
|
||||
[
|
||||
"FDM Technology",
|
||||
1
|
||||
]
|
||||
],
|
||||
"material selection": [
|
||||
[
|
||||
"Material_Selection",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Material Selection",
|
||||
1
|
||||
]
|
||||
],
|
||||
"product development": [
|
||||
[
|
||||
"Product Development",
|
||||
6
|
||||
],
|
||||
[
|
||||
"Product_Development",
|
||||
2
|
||||
]
|
||||
],
|
||||
"market research": [
|
||||
[
|
||||
"Market_Research",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Market Research",
|
||||
2
|
||||
]
|
||||
],
|
||||
"computer services": [
|
||||
[
|
||||
"Computer Services",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Computer_Services",
|
||||
1
|
||||
]
|
||||
],
|
||||
"student evaluation of instruction": [
|
||||
[
|
||||
"Student Evaluation of Instruction",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Student_Evaluation_of_Instruction",
|
||||
1
|
||||
]
|
||||
],
|
||||
"course management": [
|
||||
[
|
||||
"Course_Management",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Course Management",
|
||||
1
|
||||
]
|
||||
],
|
||||
"grade policy": [
|
||||
[
|
||||
"Grade_Policy",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Grade Policy",
|
||||
1
|
||||
]
|
||||
],
|
||||
"academic transcript": [
|
||||
[
|
||||
"Academic_Transcript",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Academic Transcript",
|
||||
1
|
||||
]
|
||||
],
|
||||
"evaluation criteria": [
|
||||
[
|
||||
"Evaluation Criteria",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Evaluation_Criteria",
|
||||
1
|
||||
]
|
||||
],
|
||||
"computer science": [
|
||||
[
|
||||
"Computer Science",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Computer_Science",
|
||||
1
|
||||
]
|
||||
],
|
||||
"electrical circuit": [
|
||||
[
|
||||
"Electrical Circuit",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Electrical_Circuit",
|
||||
1
|
||||
]
|
||||
],
|
||||
"digital logic": [
|
||||
[
|
||||
"Digital Logic",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Digital_Logic",
|
||||
1
|
||||
]
|
||||
],
|
||||
"course description": [
|
||||
[
|
||||
"Course Description",
|
||||
3
|
||||
],
|
||||
[
|
||||
"Course_Description",
|
||||
1
|
||||
]
|
||||
],
|
||||
"organizational structure": [
|
||||
[
|
||||
"Organizational_Structure",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Organizational Structure",
|
||||
1
|
||||
]
|
||||
],
|
||||
"digital design": [
|
||||
[
|
||||
"Digital_Design",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Digital Design",
|
||||
4
|
||||
]
|
||||
],
|
||||
"contact information": [
|
||||
[
|
||||
"Contact Information",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Contact_Information",
|
||||
1
|
||||
]
|
||||
],
|
||||
"professional career": [
|
||||
[
|
||||
"Professional_Career",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Professional Career",
|
||||
1
|
||||
]
|
||||
],
|
||||
"personal projects": [
|
||||
[
|
||||
"Personal_Projects",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Personal Projects",
|
||||
2
|
||||
]
|
||||
],
|
||||
"ai development": [
|
||||
[
|
||||
"AI_Development",
|
||||
1
|
||||
],
|
||||
[
|
||||
"AI Development",
|
||||
1
|
||||
]
|
||||
],
|
||||
"university service": [
|
||||
[
|
||||
"University Service",
|
||||
2
|
||||
],
|
||||
[
|
||||
"University_Service",
|
||||
1
|
||||
]
|
||||
],
|
||||
"professional exhibitions and publications": [
|
||||
[
|
||||
"Professional Exhibitions and Publications",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Professional_Exhibitions_and_Publications",
|
||||
1
|
||||
]
|
||||
],
|
||||
"selected external consulting and design work": [
|
||||
[
|
||||
"Selected External Consulting and Design Work",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Selected_External_Consulting_and_Design_Work",
|
||||
2
|
||||
]
|
||||
],
|
||||
"academic career": [
|
||||
[
|
||||
"Academic_Career",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Academic Career",
|
||||
2
|
||||
]
|
||||
],
|
||||
"technology integration": [
|
||||
[
|
||||
"Technology Integration",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Technology_Integration",
|
||||
1
|
||||
]
|
||||
],
|
||||
"artistic practice": [
|
||||
[
|
||||
"Artistic_Practice",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Artistic Practice",
|
||||
1
|
||||
]
|
||||
],
|
||||
"multi-material 3d printing": [
|
||||
[
|
||||
"Multi-Material 3D Printing",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Multi-material 3D Printing",
|
||||
1
|
||||
]
|
||||
],
|
||||
"community engagement": [
|
||||
[
|
||||
"Community Engagement",
|
||||
3
|
||||
],
|
||||
[
|
||||
"Community_Engagement",
|
||||
1
|
||||
]
|
||||
],
|
||||
"digitaldesignandfabrication": [
|
||||
[
|
||||
"DigitalDesignAndFabrication",
|
||||
1
|
||||
],
|
||||
[
|
||||
"DigitalDesignandFabrication",
|
||||
1
|
||||
]
|
||||
],
|
||||
"professional background": [
|
||||
[
|
||||
"Professional Background",
|
||||
3
|
||||
],
|
||||
[
|
||||
"Professional_Background",
|
||||
1
|
||||
]
|
||||
]
|
||||
},
|
||||
"per_doc_frame_count": {
|
||||
"3": 282,
|
||||
"5": 67,
|
||||
"4": 195,
|
||||
"2": 57,
|
||||
"7": 13,
|
||||
"11": 5,
|
||||
"13": 2,
|
||||
"15": 1,
|
||||
"12": 4,
|
||||
"6": 21,
|
||||
"8": 8,
|
||||
"10": 4,
|
||||
"9": 6,
|
||||
"30": 1,
|
||||
"14": 1,
|
||||
"18": 1
|
||||
},
|
||||
"top_30_pairs": [
|
||||
{
|
||||
"a": "Course",
|
||||
"b": "Education",
|
||||
"count": 46
|
||||
},
|
||||
{
|
||||
"a": "Education",
|
||||
"b": "Project",
|
||||
"count": 20
|
||||
},
|
||||
{
|
||||
"a": "Design",
|
||||
"b": "Education",
|
||||
"count": 20
|
||||
},
|
||||
{
|
||||
"a": "Education",
|
||||
"b": "Professional Experience",
|
||||
"count": 20
|
||||
},
|
||||
{
|
||||
"a": "Education",
|
||||
"b": "Employment",
|
||||
"count": 20
|
||||
},
|
||||
{
|
||||
"a": "Education",
|
||||
"b": "Technology",
|
||||
"count": 18
|
||||
},
|
||||
{
|
||||
"a": "Education",
|
||||
"b": "Grading",
|
||||
"count": 17
|
||||
},
|
||||
{
|
||||
"a": "Education",
|
||||
"b": "Research",
|
||||
"count": 15
|
||||
},
|
||||
{
|
||||
"a": "Art",
|
||||
"b": "Education",
|
||||
"count": 15
|
||||
},
|
||||
{
|
||||
"a": "Attendance",
|
||||
"b": "Grading",
|
||||
"count": 14
|
||||
},
|
||||
{
|
||||
"a": "Course",
|
||||
"b": "Grading",
|
||||
"count": 13
|
||||
},
|
||||
{
|
||||
"a": "Academic Integrity",
|
||||
"b": "Education",
|
||||
"count": 11
|
||||
},
|
||||
{
|
||||
"a": "Attendance",
|
||||
"b": "Education",
|
||||
"count": 11
|
||||
},
|
||||
{
|
||||
"a": "Attendance",
|
||||
"b": "Course",
|
||||
"count": 11
|
||||
},
|
||||
{
|
||||
"a": "Application",
|
||||
"b": "Employment",
|
||||
"count": 11
|
||||
},
|
||||
{
|
||||
"a": "Coursework",
|
||||
"b": "Education",
|
||||
"count": 10
|
||||
},
|
||||
{
|
||||
"a": "Course",
|
||||
"b": "Design",
|
||||
"count": 10
|
||||
},
|
||||
{
|
||||
"a": "Course",
|
||||
"b": "Programming",
|
||||
"count": 10
|
||||
},
|
||||
{
|
||||
"a": "Application",
|
||||
"b": "Education",
|
||||
"count": 10
|
||||
},
|
||||
{
|
||||
"a": "Budget",
|
||||
"b": "Education",
|
||||
"count": 10
|
||||
},
|
||||
{
|
||||
"a": "Academic Integrity",
|
||||
"b": "Accommodation",
|
||||
"count": 9
|
||||
},
|
||||
{
|
||||
"a": "Education",
|
||||
"b": "Teaching",
|
||||
"count": 9
|
||||
},
|
||||
{
|
||||
"a": "Education",
|
||||
"b": "Programming",
|
||||
"count": 9
|
||||
},
|
||||
{
|
||||
"a": "Academic Integrity",
|
||||
"b": "Attendance",
|
||||
"count": 9
|
||||
},
|
||||
{
|
||||
"a": "Course",
|
||||
"b": "Project",
|
||||
"count": 8
|
||||
},
|
||||
{
|
||||
"a": "Research",
|
||||
"b": "Teaching",
|
||||
"count": 8
|
||||
},
|
||||
{
|
||||
"a": "Grading",
|
||||
"b": "Project",
|
||||
"count": 7
|
||||
},
|
||||
{
|
||||
"a": "Art",
|
||||
"b": "Technology",
|
||||
"count": 7
|
||||
},
|
||||
{
|
||||
"a": "Academic Integrity",
|
||||
"b": "Course",
|
||||
"count": 7
|
||||
},
|
||||
{
|
||||
"a": "Accommodation",
|
||||
"b": "Course",
|
||||
"count": 7
|
||||
}
|
||||
],
|
||||
"folder_crosstab": {
|
||||
"Education": {
|
||||
"pdf": 116,
|
||||
"docx": 119,
|
||||
"pptx": 3
|
||||
},
|
||||
"Course": {
|
||||
"pdf": 29,
|
||||
"docx": 29
|
||||
},
|
||||
"Programming": {
|
||||
"pptx": 15,
|
||||
"docx": 10,
|
||||
"pdf": 12,
|
||||
"txt": 6
|
||||
},
|
||||
"Design": {
|
||||
"pdf": 13,
|
||||
"docx": 16,
|
||||
"pptx": 3
|
||||
},
|
||||
"Professional Experience": {
|
||||
"docx": 13,
|
||||
"pdf": 11
|
||||
},
|
||||
"Employment": {
|
||||
"pdf": 15,
|
||||
"docx": 9
|
||||
},
|
||||
"Research": {
|
||||
"pdf": 9,
|
||||
"docx": 13,
|
||||
"markdown": 1
|
||||
},
|
||||
"3D Printing": {
|
||||
"docx": 3,
|
||||
"pdf": 11,
|
||||
"pptx": 8
|
||||
},
|
||||
"Project": {
|
||||
"pdf": 8,
|
||||
"docx": 12,
|
||||
"markdown": 1
|
||||
},
|
||||
"Grading": {
|
||||
"pdf": 10,
|
||||
"docx": 11
|
||||
},
|
||||
"Art": {
|
||||
"docx": 11,
|
||||
"pdf": 9,
|
||||
"pptx": 1
|
||||
},
|
||||
"Budget": {
|
||||
"docx": 6,
|
||||
"pdf": 15
|
||||
},
|
||||
"Academic Integrity": {
|
||||
"docx": 17,
|
||||
"pdf": 3
|
||||
},
|
||||
"Teaching": {
|
||||
"pdf": 9,
|
||||
"docx": 10
|
||||
},
|
||||
"Technology": {
|
||||
"docx": 15,
|
||||
"pdf": 3
|
||||
},
|
||||
"Attendance": {
|
||||
"docx": 11,
|
||||
"pdf": 6
|
||||
},
|
||||
"Application": {
|
||||
"pdf": 13,
|
||||
"docx": 2
|
||||
},
|
||||
"Accommodation": {
|
||||
"docx": 11,
|
||||
"pdf": 2
|
||||
},
|
||||
"Manufacturing": {
|
||||
"docx": 6,
|
||||
"pptx": 4,
|
||||
"pdf": 3
|
||||
},
|
||||
"Coursework": {
|
||||
"pdf": 8,
|
||||
"docx": 3
|
||||
}
|
||||
},
|
||||
"bin_totals": {
|
||||
"markdown": 64,
|
||||
"pdf": 286,
|
||||
"pptx": 70,
|
||||
"txt": 28,
|
||||
"docx": 217,
|
||||
"dream_output": 3
|
||||
},
|
||||
"worker_versions": {
|
||||
"2.0": 3,
|
||||
"2.1": 665
|
||||
},
|
||||
"data_gap": {
|
||||
"count": 339,
|
||||
"by_type_bin": {
|
||||
"pdf": 110,
|
||||
"voice_note": 14,
|
||||
"docx": 110,
|
||||
"dream_output": 39,
|
||||
"pptx": 31,
|
||||
"txt": 28,
|
||||
"markdown": 7
|
||||
},
|
||||
"char_length": {
|
||||
"min": 6,
|
||||
"max": 1998,
|
||||
"median": 1077
|
||||
},
|
||||
"sample_sources": [
|
||||
"Thesis Paper Guidlines.pdf",
|
||||
"2026-04-30-17-06-voice.md",
|
||||
"2026-04-30-15-59-voice.md",
|
||||
"2026-04-30-16-53-voice.md",
|
||||
"2026-04-30-16-23-voice.md",
|
||||
"2026-04-29-17-52-voice.md",
|
||||
"2026-04-30-16-59-voice.md",
|
||||
"Outline for 3D Printed Materials for Foundry Casting.docx",
|
||||
"2026-04-26-22-52-voice.md",
|
||||
"2026-04-30-synthesis.md"
|
||||
]
|
||||
},
|
||||
"corpus_coverage": {
|
||||
"total_distinct_sources_in_embeddings": 1255,
|
||||
"conversations_no_frames_by_design": 198,
|
||||
"files_with_frames": 704,
|
||||
"files_short_no_frames": 339,
|
||||
"files_stage2_failed": 12,
|
||||
"frame_coverage_pct": 56.1
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,4 @@
|
||||
# Local backups created by apply.sh — environment state, not source.
|
||||
# Keeping these out of version control prevents repo bloat and avoids
|
||||
# checking in graphiti-core's Apache-2.0 source under our repo's tree.
|
||||
backups/
|
||||
@@ -0,0 +1,58 @@
|
||||
# graphiti-core Patches — FalkorDB Vector Index Support
|
||||
|
||||
Vendored patches against graphiti-core 0.29.0 adding native FalkorDB
|
||||
vector index support. Three files modified, all under
|
||||
`graphiti_core/driver/falkordb/` and `graphiti_core/graph_queries.py`.
|
||||
No changes to Neo4j or Kuzu code paths.
|
||||
|
||||
## Why this exists
|
||||
|
||||
graphiti-core's FalkorDB driver uses interpreted Cypher cosine math
|
||||
(`vec.cosineDistance(...)`) for similarity search. Each query becomes a
|
||||
full table scan over Entity/RELATES_TO/Community nodes. At ~4,000+
|
||||
entities, single-episode ingest's resolve-against-existing-graph step
|
||||
takes 8+ minutes and bulk ingest hangs FalkorDB. FalkorDB itself
|
||||
supports `db.idx.vector.queryNodes` and `db.idx.vector.queryRelationships`
|
||||
procedures backed by HNSW indexes; graphiti-core's driver doesn't use
|
||||
them.
|
||||
|
||||
These patches:
|
||||
|
||||
1. Add `get_vector_indices()` to `graph_queries.py` returning CREATE
|
||||
VECTOR INDEX statements for FalkorDB on Entity.name_embedding,
|
||||
RELATES_TO.fact_embedding, and Community.name_embedding.
|
||||
2. Extend `falkordb_driver.py:build_indices_and_constraints()` to create
|
||||
the vector indexes alongside range and fulltext indexes.
|
||||
3. Rewrite the three vector-similarity call sites in
|
||||
`falkordb/operations/search_ops.py` to use
|
||||
`db.idx.vector.queryNodes` and `db.idx.vector.queryRelationships`
|
||||
instead of full-scan cosine math. Over-fetches by a configurable
|
||||
multiplier to handle filter rejections.
|
||||
|
||||
## Files
|
||||
|
||||
| Patched file | Source |
|
||||
|---|---|
|
||||
| `graphiti_core/graph_queries.py` | Adds `get_vector_indices()` |
|
||||
| `graphiti_core/driver/falkordb/falkordb_driver.py` | Extends `build_indices_and_constraints` |
|
||||
| `graphiti_core/driver/falkordb/operations/search_ops.py` | Three query rewrites |
|
||||
|
||||
## How to apply
|
||||
|
||||
`./apply.sh` — backs up the originals into `./backups/<timestamp>/`
|
||||
and copies the patched files over.
|
||||
|
||||
## How to revert
|
||||
|
||||
Move the timestamped backup back over the venv:
|
||||
|
||||
cp backups/<ts>/graph_queries.py /home/aaron/aaronai/venv/lib/python3.12/site-packages/graphiti_core/graph_queries.py
|
||||
# ...etc
|
||||
|
||||
## Upstream candidate
|
||||
|
||||
Documented gap (issue #1263 references it indirectly via vector store
|
||||
overlay RFC). Maintainers' attention is on Milvus/external vector DB
|
||||
overlay; this patch is the FalkorDB-native alternative for users who
|
||||
don't want a separate vector DB. Consider PR after empirical validation
|
||||
in production.
|
||||
Executable
+77
@@ -0,0 +1,77 @@
|
||||
#!/usr/bin/env bash
|
||||
# apply.sh — Apply the BirdAI vendored graphiti-core patches.
|
||||
#
|
||||
# Backs up the original venv files into ./backups/<timestamp>/ before
|
||||
# overwriting. The backup directory layout mirrors the venv layout so a
|
||||
# revert is just a tree copy back.
|
||||
#
|
||||
# Usage: ./apply.sh
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
PATCH_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
VENV_BASE="/home/aaron/aaronai/venv/lib/python3.12/site-packages"
|
||||
TIMESTAMP="$(date +%Y%m%d-%H%M%S)"
|
||||
BACKUP_DIR="$PATCH_DIR/backups/$TIMESTAMP"
|
||||
|
||||
# Files to patch — paths relative to graphiti_core/.
|
||||
FILES=(
|
||||
"graph_queries.py"
|
||||
"driver/falkordb_driver.py"
|
||||
"driver/falkordb/operations/search_ops.py"
|
||||
)
|
||||
|
||||
echo "graphiti-core vendored patch apply — BirdAI"
|
||||
echo "Patch directory: $PATCH_DIR"
|
||||
echo "Venv target: $VENV_BASE/graphiti_core/"
|
||||
echo "Backup to: $BACKUP_DIR"
|
||||
echo
|
||||
|
||||
# Pre-flight: confirm all source patch files exist.
|
||||
for rel in "${FILES[@]}"; do
|
||||
if [ ! -f "$PATCH_DIR/graphiti_core/$rel" ]; then
|
||||
echo "ERROR: missing patch file: $PATCH_DIR/graphiti_core/$rel" >&2
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
|
||||
# Pre-flight: confirm all target venv files exist.
|
||||
for rel in "${FILES[@]}"; do
|
||||
if [ ! -f "$VENV_BASE/graphiti_core/$rel" ]; then
|
||||
echo "ERROR: missing venv file: $VENV_BASE/graphiti_core/$rel" >&2
|
||||
echo " graphiti-core may not be installed, or version differs from 0.29.0." >&2
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
|
||||
# Backup originals.
|
||||
echo "[1/3] Backing up originals..."
|
||||
for rel in "${FILES[@]}"; do
|
||||
backup_path="$BACKUP_DIR/graphiti_core/$rel"
|
||||
mkdir -p "$(dirname "$backup_path")"
|
||||
cp "$VENV_BASE/graphiti_core/$rel" "$backup_path"
|
||||
echo " backed up: $rel"
|
||||
done
|
||||
echo
|
||||
|
||||
# Apply patches by copying.
|
||||
echo "[2/3] Applying patches..."
|
||||
for rel in "${FILES[@]}"; do
|
||||
cp "$PATCH_DIR/graphiti_core/$rel" "$VENV_BASE/graphiti_core/$rel"
|
||||
echo " patched: $rel"
|
||||
done
|
||||
echo
|
||||
|
||||
# Sanity check: confirm patched files have the marker.
|
||||
echo "[3/3] Verifying patched files..."
|
||||
for rel in "${FILES[@]}"; do
|
||||
if grep -q "PATCHED 2026-05-02" "$VENV_BASE/graphiti_core/$rel"; then
|
||||
echo " OK: $rel contains patch marker"
|
||||
else
|
||||
echo " WARNING: $rel missing patch marker (may be expected for graph_queries.py — its docstring uses the marker only in the module header)"
|
||||
fi
|
||||
done
|
||||
echo
|
||||
echo "Done. Backup: $BACKUP_DIR"
|
||||
echo "Restart the sidecar to pick up changes:"
|
||||
echo " sudo systemctl restart aaronai-graphiti.service"
|
||||
@@ -0,0 +1,904 @@
|
||||
"""
|
||||
Copyright 2024, Zep Software, Inc.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Any
|
||||
|
||||
from graphiti_core.driver.driver import GraphProvider
|
||||
from graphiti_core.driver.falkordb import STOPWORDS
|
||||
from graphiti_core.driver.operations.search_ops import SearchOperations
|
||||
from graphiti_core.driver.query_executor import QueryExecutor
|
||||
from graphiti_core.driver.record_parsers import (
|
||||
community_node_from_record,
|
||||
entity_edge_from_record,
|
||||
entity_node_from_record,
|
||||
episodic_node_from_record,
|
||||
)
|
||||
from graphiti_core.edges import EntityEdge
|
||||
from graphiti_core.graph_queries import (
|
||||
get_nodes_query,
|
||||
get_relationships_query,
|
||||
get_vector_cosine_func_query,
|
||||
)
|
||||
from graphiti_core.models.edges.edge_db_queries import get_entity_edge_return_query
|
||||
from graphiti_core.models.nodes.node_db_queries import (
|
||||
COMMUNITY_NODE_RETURN,
|
||||
EPISODIC_NODE_RETURN,
|
||||
get_entity_node_return_query,
|
||||
)
|
||||
from graphiti_core.nodes import CommunityNode, EntityNode, EpisodicNode
|
||||
from graphiti_core.search.search_filters import (
|
||||
SearchFilters,
|
||||
edge_search_filter_query_constructor,
|
||||
node_search_filter_query_constructor,
|
||||
)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
MAX_QUERY_LENGTH = 128
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Vector index dispatcher (PATCHED 2026-05-02, BirdAI vendored patch).
|
||||
#
|
||||
# graphiti-core's FalkorDB driver historically composed similarity queries
|
||||
# using `vec.cosineDistance(...)` in interpreted Cypher, which produces a
|
||||
# full-table scan for every search. FalkorDB supports native vector indexes
|
||||
# via `db.idx.vector.queryNodes` and `db.idx.vector.queryRelationships`;
|
||||
# this dispatcher uses them when present and falls back to the cosine math
|
||||
# otherwise.
|
||||
#
|
||||
# Index existence is checked once per (label, attribute, entity_type) and
|
||||
# cached at module scope. The cache should be invalidated whenever
|
||||
# `build_indices_and_constraints` runs (since indexes may have been created
|
||||
# or dropped). FalkorDriver.build_indices_and_constraints is patched to
|
||||
# call `_invalidate_falkordb_vector_index_cache()` after building.
|
||||
#
|
||||
# Over-fetch factor (VECTOR_INDEX_CANDIDATE_MULTIPLIER from graph_queries)
|
||||
# preserves recall when WHERE filters reject some of the top-k candidates.
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
from graphiti_core.graph_queries import (
|
||||
VECTOR_INDEX_CANDIDATE_MULTIPLIER,
|
||||
get_vector_cosine_func_query,
|
||||
)
|
||||
|
||||
# Cache: key = (label, attribute, entity_type), value = bool
|
||||
# entity_type is 'NODE' or 'RELATIONSHIP'.
|
||||
_FALKORDB_VECTOR_INDEX_CACHE: dict[tuple[str, str, str], bool] = {}
|
||||
|
||||
|
||||
def _invalidate_falkordb_vector_index_cache() -> None:
|
||||
"""Clear the vector-index existence cache. Call after build_indices_and_constraints."""
|
||||
_FALKORDB_VECTOR_INDEX_CACHE.clear()
|
||||
|
||||
|
||||
async def _falkordb_vector_index_exists(
|
||||
executor: QueryExecutor,
|
||||
label: str,
|
||||
attribute: str,
|
||||
entity_type: str,
|
||||
) -> bool:
|
||||
"""Check whether a FalkorDB vector index exists for the given target.
|
||||
|
||||
entity_type is 'NODE' for node-label indexes, 'RELATIONSHIP' for edge-type indexes.
|
||||
Result is cached at module scope; call _invalidate_falkordb_vector_index_cache()
|
||||
after building or dropping indexes.
|
||||
"""
|
||||
key = (label, attribute, entity_type)
|
||||
if key in _FALKORDB_VECTOR_INDEX_CACHE:
|
||||
return _FALKORDB_VECTOR_INDEX_CACHE[key]
|
||||
|
||||
try:
|
||||
records, _, _ = await executor.execute_query(
|
||||
"CALL db.indexes() YIELD label, properties, types, entitytype "
|
||||
"RETURN label, properties, types, entitytype"
|
||||
)
|
||||
except Exception as e:
|
||||
# If we cannot enumerate indexes, fall back to "no index" rather than
|
||||
# propagating the error. The fallback cosine-math path is correct,
|
||||
# just slower.
|
||||
logger.warning(f"FalkorDB vector index probe failed; assuming none exist: {e}")
|
||||
_FALKORDB_VECTOR_INDEX_CACHE[key] = False
|
||||
return False
|
||||
|
||||
found = False
|
||||
for r in records:
|
||||
# Records come back as dict-like rows keyed by column name (not
|
||||
# tuples). Access by string keys matching the YIELD clause above.
|
||||
rec_label = r.get('label') if hasattr(r, 'get') else r['label']
|
||||
rec_props = r.get('properties') if hasattr(r, 'get') else r['properties']
|
||||
rec_types = r.get('types') if hasattr(r, 'get') else r['types']
|
||||
rec_entitytype = r.get('entitytype') if hasattr(r, 'get') else r['entitytype']
|
||||
if rec_props is None:
|
||||
rec_props = []
|
||||
if rec_types is None:
|
||||
rec_types = {}
|
||||
|
||||
if rec_label != label:
|
||||
continue
|
||||
if rec_entitytype is not None and rec_entitytype != entity_type:
|
||||
continue
|
||||
if attribute not in rec_props:
|
||||
continue
|
||||
|
||||
# rec_types is a dict like {attribute: ['VECTOR', ...], ...} or sometimes
|
||||
# a flat list — handle both shapes.
|
||||
if isinstance(rec_types, dict):
|
||||
attr_types = rec_types.get(attribute, [])
|
||||
else:
|
||||
attr_types = rec_types
|
||||
if 'VECTOR' in attr_types:
|
||||
found = True
|
||||
break
|
||||
|
||||
_FALKORDB_VECTOR_INDEX_CACHE[key] = found
|
||||
return found
|
||||
|
||||
|
||||
def _falkordb_vector_node_search_cypher(
|
||||
label: str,
|
||||
embedding_attr: str,
|
||||
search_vector_param: str,
|
||||
use_index: bool,
|
||||
) -> tuple[str, str]:
|
||||
"""Build the cypher prefix and node-binding for a node-vector search.
|
||||
|
||||
Returns (prefix, node_var) where:
|
||||
- prefix is the Cypher fragment that binds the node variable and a
|
||||
`score` variable. With index, it's a CALL ... YIELD; without, it's
|
||||
a MATCH plus WITH cosine math.
|
||||
- node_var is the variable name the caller's downstream Cypher should
|
||||
reference (always 'n' here for parity with the existing code).
|
||||
|
||||
The caller appends WHERE filters and RETURN/ORDER BY/LIMIT as usual.
|
||||
The over-fetch parameter `$candidate_k` must be passed by the caller
|
||||
when use_index is True.
|
||||
"""
|
||||
if use_index:
|
||||
return (
|
||||
f"CALL db.idx.vector.queryNodes("
|
||||
f"'{label}', '{embedding_attr}', $candidate_k, vecf32({search_vector_param})"
|
||||
f") YIELD node, score "
|
||||
f"WITH node AS n, score "
|
||||
), "n"
|
||||
# Fallback: original cosine math path
|
||||
cosine = get_vector_cosine_func_query(
|
||||
f"n.{embedding_attr}", search_vector_param, GraphProvider.FALKORDB
|
||||
)
|
||||
return (
|
||||
f"MATCH (n:{label}) "
|
||||
f"WITH n, {cosine} AS score "
|
||||
), "n"
|
||||
|
||||
|
||||
def _falkordb_vector_edge_search_cypher(
|
||||
relationship_type: str,
|
||||
embedding_attr: str,
|
||||
search_vector_param: str,
|
||||
use_index: bool,
|
||||
) -> tuple[str, str]:
|
||||
"""Build the cypher prefix and edge-binding for an edge-vector search.
|
||||
|
||||
Returns (prefix, edge_var). With the index, the procedure binds the
|
||||
relationship variable; we then MATCH source and target via the existing
|
||||
edge to recover (n)-[e]->(m). Without the index, it's the original
|
||||
MATCH-and-cosine path.
|
||||
|
||||
Variable name is 'e' for parity with existing code; source/target are
|
||||
'n' and 'm' respectively, also for parity.
|
||||
"""
|
||||
if use_index:
|
||||
return (
|
||||
f"CALL db.idx.vector.queryRelationships("
|
||||
f"'{relationship_type}', '{embedding_attr}', $candidate_k, vecf32({search_vector_param})"
|
||||
f") YIELD relationship, score "
|
||||
f"MATCH (n:Entity)-[e:{relationship_type}]->(m:Entity) "
|
||||
f"WHERE e = relationship "
|
||||
f"WITH DISTINCT e, n, m, score "
|
||||
), "e"
|
||||
# Fallback
|
||||
cosine = get_vector_cosine_func_query(
|
||||
f"e.{embedding_attr}", search_vector_param, GraphProvider.FALKORDB
|
||||
)
|
||||
return (
|
||||
f"MATCH (n:Entity)-[e:{relationship_type}]->(m:Entity) "
|
||||
f"WITH DISTINCT e, n, m, {cosine} AS score "
|
||||
), "e"
|
||||
|
||||
|
||||
|
||||
# FalkorDB separator characters that break text into tokens
|
||||
_SEPARATOR_MAP = str.maketrans(
|
||||
{
|
||||
',': ' ',
|
||||
'.': ' ',
|
||||
'<': ' ',
|
||||
'>': ' ',
|
||||
'{': ' ',
|
||||
'}': ' ',
|
||||
'[': ' ',
|
||||
']': ' ',
|
||||
'"': ' ',
|
||||
"'": ' ',
|
||||
':': ' ',
|
||||
';': ' ',
|
||||
'!': ' ',
|
||||
'@': ' ',
|
||||
'#': ' ',
|
||||
'$': ' ',
|
||||
'%': ' ',
|
||||
'^': ' ',
|
||||
'&': ' ',
|
||||
'*': ' ',
|
||||
'(': ' ',
|
||||
')': ' ',
|
||||
'-': ' ',
|
||||
'+': ' ',
|
||||
'=': ' ',
|
||||
'~': ' ',
|
||||
'?': ' ',
|
||||
'|': ' ',
|
||||
'/': ' ',
|
||||
'\\': ' ',
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
def _sanitize(query: str) -> str:
|
||||
"""Replace FalkorDB special characters with whitespace."""
|
||||
sanitized = query.translate(_SEPARATOR_MAP)
|
||||
return ' '.join(sanitized.split())
|
||||
|
||||
|
||||
def _build_falkor_fulltext_query(
|
||||
query: str,
|
||||
group_ids: list[str] | None = None,
|
||||
max_query_length: int = MAX_QUERY_LENGTH,
|
||||
) -> str:
|
||||
"""Build a fulltext query string for FalkorDB using RedisSearch syntax."""
|
||||
if group_ids is None or len(group_ids) == 0:
|
||||
group_filter = ''
|
||||
else:
|
||||
escaped_group_ids = [f'"{gid}"' for gid in group_ids]
|
||||
group_values = '|'.join(escaped_group_ids)
|
||||
group_filter = f'(@group_id:{group_values})'
|
||||
|
||||
sanitized_query = _sanitize(query)
|
||||
|
||||
# Remove stopwords and empty tokens
|
||||
query_words = sanitized_query.split()
|
||||
filtered_words = [word for word in query_words if word and word.lower() not in STOPWORDS]
|
||||
sanitized_query = ' | '.join(filtered_words)
|
||||
|
||||
if len(sanitized_query.split(' ')) + len(group_ids or '') >= max_query_length:
|
||||
return ''
|
||||
|
||||
full_query = group_filter + ' (' + sanitized_query + ')'
|
||||
return full_query
|
||||
|
||||
|
||||
class FalkorSearchOperations(SearchOperations):
|
||||
# --- Node search ---
|
||||
|
||||
async def node_fulltext_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
query: str,
|
||||
search_filter: SearchFilters,
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
) -> list[EntityNode]:
|
||||
fuzzy_query = _build_falkor_fulltext_query(query, group_ids)
|
||||
if fuzzy_query == '':
|
||||
return []
|
||||
|
||||
filter_queries, filter_params = node_search_filter_query_constructor(
|
||||
search_filter, GraphProvider.FALKORDB
|
||||
)
|
||||
|
||||
if group_ids is not None:
|
||||
filter_queries.append('n.group_id IN $group_ids')
|
||||
filter_params['group_ids'] = group_ids
|
||||
|
||||
filter_query = ''
|
||||
if filter_queries:
|
||||
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
|
||||
|
||||
cypher = (
|
||||
get_nodes_query(
|
||||
'node_name_and_summary', '$query', limit=limit, provider=GraphProvider.FALKORDB
|
||||
)
|
||||
+ 'YIELD node AS n, score'
|
||||
+ filter_query
|
||||
+ """
|
||||
WITH n, score
|
||||
ORDER BY score DESC
|
||||
LIMIT $limit
|
||||
RETURN
|
||||
"""
|
||||
+ get_entity_node_return_query(GraphProvider.FALKORDB)
|
||||
)
|
||||
|
||||
records, _, _ = await executor.execute_query(
|
||||
cypher,
|
||||
query=fuzzy_query,
|
||||
limit=limit,
|
||||
**filter_params,
|
||||
)
|
||||
|
||||
return [entity_node_from_record(r) for r in records]
|
||||
|
||||
async def node_similarity_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
search_vector: list[float],
|
||||
search_filter: SearchFilters,
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
min_score: float = 0.6,
|
||||
) -> list[EntityNode]:
|
||||
filter_queries, filter_params = node_search_filter_query_constructor(
|
||||
search_filter, GraphProvider.FALKORDB
|
||||
)
|
||||
|
||||
if group_ids is not None:
|
||||
filter_queries.append('n.group_id IN $group_ids')
|
||||
filter_params['group_ids'] = group_ids
|
||||
|
||||
filter_query = ''
|
||||
if filter_queries:
|
||||
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
|
||||
|
||||
# PATCHED 2026-05-02 (BirdAI vendored patch): use FalkorDB native vector
|
||||
# index when available; fall back to interpreted-Cypher cosine math
|
||||
# otherwise. The filter clause's position changes between paths
|
||||
# (after MATCH for fallback, after YIELD for index path), but the
|
||||
# filter expressions themselves are identical because they reference
|
||||
# the bound variable `n` either way.
|
||||
use_index = await _falkordb_vector_index_exists(
|
||||
executor, 'Entity', 'name_embedding', 'NODE'
|
||||
)
|
||||
prefix, _ = _falkordb_vector_node_search_cypher(
|
||||
'Entity', 'name_embedding', '$search_vector', use_index
|
||||
)
|
||||
where_clauses = []
|
||||
if filter_query:
|
||||
where_clauses.append(filter_query.replace(' WHERE ', '', 1).strip())
|
||||
where_clauses.append('score > $min_score')
|
||||
unified_where = ' WHERE ' + ' AND '.join(where_clauses)
|
||||
|
||||
cypher = (
|
||||
prefix
|
||||
+ unified_where
|
||||
+ """
|
||||
RETURN
|
||||
"""
|
||||
+ get_entity_node_return_query(GraphProvider.FALKORDB)
|
||||
+ """
|
||||
ORDER BY score DESC
|
||||
LIMIT $limit
|
||||
"""
|
||||
)
|
||||
params = dict(
|
||||
search_vector=search_vector,
|
||||
limit=limit,
|
||||
min_score=min_score,
|
||||
**filter_params,
|
||||
)
|
||||
if use_index:
|
||||
params['candidate_k'] = limit * VECTOR_INDEX_CANDIDATE_MULTIPLIER
|
||||
records, _, _ = await executor.execute_query(cypher, **params)
|
||||
|
||||
return [entity_node_from_record(r) for r in records]
|
||||
|
||||
async def node_bfs_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
origin_uuids: list[str],
|
||||
search_filter: SearchFilters,
|
||||
max_depth: int,
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
) -> list[EntityNode]:
|
||||
if not origin_uuids or max_depth < 1:
|
||||
return []
|
||||
|
||||
filter_queries, filter_params = node_search_filter_query_constructor(
|
||||
search_filter, GraphProvider.FALKORDB
|
||||
)
|
||||
|
||||
if group_ids is not None:
|
||||
filter_queries.append('n.group_id IN $group_ids')
|
||||
filter_queries.append('origin.group_id IN $group_ids')
|
||||
filter_params['group_ids'] = group_ids
|
||||
|
||||
filter_query = ''
|
||||
if filter_queries:
|
||||
filter_query = ' AND ' + (' AND '.join(filter_queries))
|
||||
|
||||
cypher = (
|
||||
f"""
|
||||
UNWIND $bfs_origin_node_uuids AS origin_uuid
|
||||
MATCH (origin {{uuid: origin_uuid}})-[:RELATES_TO|MENTIONS*1..{max_depth}]->(n:Entity)
|
||||
WHERE n.group_id = origin.group_id
|
||||
"""
|
||||
+ filter_query
|
||||
+ """
|
||||
RETURN
|
||||
"""
|
||||
+ get_entity_node_return_query(GraphProvider.FALKORDB)
|
||||
+ """
|
||||
LIMIT $limit
|
||||
"""
|
||||
)
|
||||
|
||||
records, _, _ = await executor.execute_query(
|
||||
cypher,
|
||||
bfs_origin_node_uuids=origin_uuids,
|
||||
limit=limit,
|
||||
**filter_params,
|
||||
)
|
||||
|
||||
return [entity_node_from_record(r) for r in records]
|
||||
|
||||
# --- Edge search ---
|
||||
|
||||
async def edge_fulltext_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
query: str,
|
||||
search_filter: SearchFilters,
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
) -> list[EntityEdge]:
|
||||
fuzzy_query = _build_falkor_fulltext_query(query, group_ids)
|
||||
if fuzzy_query == '':
|
||||
return []
|
||||
|
||||
filter_queries, filter_params = edge_search_filter_query_constructor(
|
||||
search_filter, GraphProvider.FALKORDB
|
||||
)
|
||||
|
||||
if group_ids is not None:
|
||||
filter_queries.append('e.group_id IN $group_ids')
|
||||
filter_params['group_ids'] = group_ids
|
||||
|
||||
filter_query = ''
|
||||
if filter_queries:
|
||||
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
|
||||
|
||||
cypher = (
|
||||
get_relationships_query(
|
||||
'edge_name_and_fact', limit=limit, provider=GraphProvider.FALKORDB
|
||||
)
|
||||
+ """
|
||||
YIELD relationship AS rel, score
|
||||
MATCH (n:Entity)-[e:RELATES_TO {uuid: rel.uuid}]->(m:Entity)
|
||||
"""
|
||||
+ filter_query
|
||||
+ """
|
||||
WITH e, score, n, m
|
||||
RETURN
|
||||
"""
|
||||
+ get_entity_edge_return_query(GraphProvider.FALKORDB)
|
||||
+ """
|
||||
ORDER BY score DESC
|
||||
LIMIT $limit
|
||||
"""
|
||||
)
|
||||
|
||||
records, _, _ = await executor.execute_query(
|
||||
cypher,
|
||||
query=fuzzy_query,
|
||||
limit=limit,
|
||||
**filter_params,
|
||||
)
|
||||
|
||||
return [entity_edge_from_record(r) for r in records]
|
||||
|
||||
async def edge_similarity_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
search_vector: list[float],
|
||||
source_node_uuid: str | None,
|
||||
target_node_uuid: str | None,
|
||||
search_filter: SearchFilters,
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
min_score: float = 0.6,
|
||||
) -> list[EntityEdge]:
|
||||
filter_queries, filter_params = edge_search_filter_query_constructor(
|
||||
search_filter, GraphProvider.FALKORDB
|
||||
)
|
||||
|
||||
if group_ids is not None:
|
||||
filter_queries.append('e.group_id IN $group_ids')
|
||||
filter_params['group_ids'] = group_ids
|
||||
|
||||
if source_node_uuid is not None:
|
||||
filter_params['source_uuid'] = source_node_uuid
|
||||
filter_queries.append('n.uuid = $source_uuid')
|
||||
|
||||
if target_node_uuid is not None:
|
||||
filter_params['target_uuid'] = target_node_uuid
|
||||
filter_queries.append('m.uuid = $target_uuid')
|
||||
|
||||
filter_query = ''
|
||||
if filter_queries:
|
||||
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
|
||||
|
||||
# PATCHED 2026-05-02 (BirdAI vendored patch): use FalkorDB native vector
|
||||
# index on RELATES_TO.fact_embedding when available. The unindexed
|
||||
# fallback is the same MATCH-and-cosine math that previously hung
|
||||
# for 6+ minutes on a 4,000-entity graph; this is the load-bearing
|
||||
# call site that motivated the patch.
|
||||
use_index = await _falkordb_vector_index_exists(
|
||||
executor, 'RELATES_TO', 'fact_embedding', 'RELATIONSHIP'
|
||||
)
|
||||
prefix, _ = _falkordb_vector_edge_search_cypher(
|
||||
'RELATES_TO', 'fact_embedding', '$search_vector', use_index
|
||||
)
|
||||
where_clauses = []
|
||||
if filter_query:
|
||||
where_clauses.append(filter_query.replace(' WHERE ', '', 1).strip())
|
||||
where_clauses.append('score > $min_score')
|
||||
unified_where = ' WHERE ' + ' AND '.join(where_clauses)
|
||||
|
||||
cypher = (
|
||||
prefix
|
||||
+ unified_where
|
||||
+ """
|
||||
RETURN
|
||||
"""
|
||||
+ get_entity_edge_return_query(GraphProvider.FALKORDB)
|
||||
+ """
|
||||
ORDER BY score DESC
|
||||
LIMIT $limit
|
||||
"""
|
||||
)
|
||||
params = dict(
|
||||
search_vector=search_vector,
|
||||
limit=limit,
|
||||
min_score=min_score,
|
||||
**filter_params,
|
||||
)
|
||||
if use_index:
|
||||
params['candidate_k'] = limit * VECTOR_INDEX_CANDIDATE_MULTIPLIER
|
||||
records, _, _ = await executor.execute_query(cypher, **params)
|
||||
|
||||
return [entity_edge_from_record(r) for r in records]
|
||||
|
||||
async def edge_bfs_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
origin_uuids: list[str],
|
||||
max_depth: int,
|
||||
search_filter: SearchFilters,
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
) -> list[EntityEdge]:
|
||||
if not origin_uuids:
|
||||
return []
|
||||
|
||||
filter_queries, filter_params = edge_search_filter_query_constructor(
|
||||
search_filter, GraphProvider.FALKORDB
|
||||
)
|
||||
|
||||
if group_ids is not None:
|
||||
filter_queries.append('e.group_id IN $group_ids')
|
||||
filter_params['group_ids'] = group_ids
|
||||
|
||||
filter_query = ''
|
||||
if filter_queries:
|
||||
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
|
||||
|
||||
cypher = (
|
||||
f"""
|
||||
UNWIND $bfs_origin_node_uuids AS origin_uuid
|
||||
MATCH path = (origin {{uuid: origin_uuid}})-[:RELATES_TO|MENTIONS*1..{max_depth}]->(:Entity)
|
||||
UNWIND relationships(path) AS rel
|
||||
MATCH (n:Entity)-[e:RELATES_TO {{uuid: rel.uuid}}]-(m:Entity)
|
||||
"""
|
||||
+ filter_query
|
||||
+ """
|
||||
RETURN DISTINCT
|
||||
"""
|
||||
+ get_entity_edge_return_query(GraphProvider.FALKORDB)
|
||||
+ """
|
||||
LIMIT $limit
|
||||
"""
|
||||
)
|
||||
|
||||
records, _, _ = await executor.execute_query(
|
||||
cypher,
|
||||
bfs_origin_node_uuids=origin_uuids,
|
||||
depth=max_depth,
|
||||
limit=limit,
|
||||
**filter_params,
|
||||
)
|
||||
|
||||
return [entity_edge_from_record(r) for r in records]
|
||||
|
||||
# --- Episode search ---
|
||||
|
||||
async def episode_fulltext_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
query: str,
|
||||
search_filter: SearchFilters, # noqa: ARG002
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
) -> list[EpisodicNode]:
|
||||
fuzzy_query = _build_falkor_fulltext_query(query, group_ids)
|
||||
if fuzzy_query == '':
|
||||
return []
|
||||
|
||||
filter_params: dict[str, Any] = {}
|
||||
group_filter_query = ''
|
||||
if group_ids is not None:
|
||||
group_filter_query += '\nAND e.group_id IN $group_ids'
|
||||
filter_params['group_ids'] = group_ids
|
||||
|
||||
cypher = (
|
||||
get_nodes_query(
|
||||
'episode_content', '$query', limit=limit, provider=GraphProvider.FALKORDB
|
||||
)
|
||||
+ """
|
||||
YIELD node AS episode, score
|
||||
MATCH (e:Episodic)
|
||||
WHERE e.uuid = episode.uuid
|
||||
"""
|
||||
+ group_filter_query
|
||||
+ """
|
||||
RETURN
|
||||
"""
|
||||
+ EPISODIC_NODE_RETURN
|
||||
+ """
|
||||
ORDER BY score DESC
|
||||
LIMIT $limit
|
||||
"""
|
||||
)
|
||||
|
||||
records, _, _ = await executor.execute_query(
|
||||
cypher, query=fuzzy_query, limit=limit, **filter_params
|
||||
)
|
||||
|
||||
return [episodic_node_from_record(r) for r in records]
|
||||
|
||||
# --- Community search ---
|
||||
|
||||
async def community_fulltext_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
query: str,
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
) -> list[CommunityNode]:
|
||||
fuzzy_query = _build_falkor_fulltext_query(query, group_ids)
|
||||
if fuzzy_query == '':
|
||||
return []
|
||||
|
||||
filter_params: dict[str, Any] = {}
|
||||
group_filter_query = ''
|
||||
if group_ids is not None:
|
||||
group_filter_query = 'WHERE c.group_id IN $group_ids'
|
||||
filter_params['group_ids'] = group_ids
|
||||
|
||||
cypher = (
|
||||
get_nodes_query(
|
||||
'community_name', '$query', limit=limit, provider=GraphProvider.FALKORDB
|
||||
)
|
||||
+ """
|
||||
YIELD node AS c, score
|
||||
WITH c, score
|
||||
"""
|
||||
+ group_filter_query
|
||||
+ """
|
||||
RETURN
|
||||
"""
|
||||
+ COMMUNITY_NODE_RETURN
|
||||
+ """
|
||||
ORDER BY score DESC
|
||||
LIMIT $limit
|
||||
"""
|
||||
)
|
||||
|
||||
records, _, _ = await executor.execute_query(
|
||||
cypher, query=fuzzy_query, limit=limit, **filter_params
|
||||
)
|
||||
|
||||
return [community_node_from_record(r) for r in records]
|
||||
|
||||
async def community_similarity_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
search_vector: list[float],
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
min_score: float = 0.6,
|
||||
) -> list[CommunityNode]:
|
||||
query_params: dict[str, Any] = {}
|
||||
|
||||
group_filter_query = ''
|
||||
if group_ids is not None:
|
||||
group_filter_query += ' WHERE c.group_id IN $group_ids'
|
||||
query_params['group_ids'] = group_ids
|
||||
|
||||
# PATCHED 2026-05-02 (BirdAI vendored patch): use FalkorDB native vector
|
||||
# index on Community.name_embedding when available. Note: the existing
|
||||
# filter is built into `group_filter_query` (already prefixed with
|
||||
# ' WHERE ' if non-empty) and uses variable `c`. The dispatcher binds
|
||||
# the node as `n` for parity with the helper signature, then we
|
||||
# re-bind to `c` via WITH so the rest of the query is unchanged.
|
||||
use_index = await _falkordb_vector_index_exists(
|
||||
executor, 'Community', 'name_embedding', 'NODE'
|
||||
)
|
||||
prefix, _ = _falkordb_vector_node_search_cypher(
|
||||
'Community', 'name_embedding', '$search_vector', use_index
|
||||
)
|
||||
prefix = prefix + ' WITH n AS c, score '
|
||||
where_clauses = []
|
||||
if group_filter_query:
|
||||
where_clauses.append(group_filter_query.replace(' WHERE ', '', 1).strip())
|
||||
where_clauses.append('score > $min_score')
|
||||
unified_where = ' WHERE ' + ' AND '.join(where_clauses)
|
||||
|
||||
cypher = (
|
||||
prefix
|
||||
+ unified_where
|
||||
+ """
|
||||
RETURN
|
||||
"""
|
||||
+ COMMUNITY_NODE_RETURN
|
||||
+ """
|
||||
ORDER BY score DESC
|
||||
LIMIT $limit
|
||||
"""
|
||||
)
|
||||
params = dict(
|
||||
search_vector=search_vector,
|
||||
limit=limit,
|
||||
min_score=min_score,
|
||||
**query_params,
|
||||
)
|
||||
if use_index:
|
||||
params['candidate_k'] = limit * VECTOR_INDEX_CANDIDATE_MULTIPLIER
|
||||
records, _, _ = await executor.execute_query(cypher, **params)
|
||||
|
||||
return [community_node_from_record(r) for r in records]
|
||||
|
||||
# --- Rerankers ---
|
||||
|
||||
async def node_distance_reranker(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
node_uuids: list[str],
|
||||
center_node_uuid: str,
|
||||
min_score: float = 0,
|
||||
) -> list[EntityNode]:
|
||||
filtered_uuids = [u for u in node_uuids if u != center_node_uuid]
|
||||
scores: dict[str, float] = {center_node_uuid: 0.0}
|
||||
|
||||
cypher = """
|
||||
UNWIND $node_uuids AS node_uuid
|
||||
MATCH (center:Entity {uuid: $center_uuid})-[:RELATES_TO]-(n:Entity {uuid: node_uuid})
|
||||
RETURN 1 AS score, node_uuid AS uuid
|
||||
"""
|
||||
|
||||
results, _, _ = await executor.execute_query(
|
||||
cypher,
|
||||
node_uuids=filtered_uuids,
|
||||
center_uuid=center_node_uuid,
|
||||
)
|
||||
|
||||
for result in results:
|
||||
scores[result['uuid']] = result['score']
|
||||
|
||||
for uuid in filtered_uuids:
|
||||
if uuid not in scores:
|
||||
scores[uuid] = float('inf')
|
||||
|
||||
filtered_uuids.sort(key=lambda cur_uuid: scores[cur_uuid])
|
||||
|
||||
if center_node_uuid in node_uuids:
|
||||
scores[center_node_uuid] = 0.1
|
||||
filtered_uuids = [center_node_uuid] + filtered_uuids
|
||||
|
||||
reranked_uuids = [u for u in filtered_uuids if (1 / scores[u]) >= min_score]
|
||||
|
||||
if not reranked_uuids:
|
||||
return []
|
||||
|
||||
get_query = """
|
||||
MATCH (n:Entity)
|
||||
WHERE n.uuid IN $uuids
|
||||
RETURN
|
||||
""" + get_entity_node_return_query(GraphProvider.FALKORDB)
|
||||
|
||||
records, _, _ = await executor.execute_query(get_query, uuids=reranked_uuids)
|
||||
|
||||
node_map = {r['uuid']: entity_node_from_record(r) for r in records}
|
||||
return [node_map[u] for u in reranked_uuids if u in node_map]
|
||||
|
||||
async def episode_mentions_reranker(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
node_uuids: list[str],
|
||||
min_score: float = 0,
|
||||
) -> list[EntityNode]:
|
||||
if not node_uuids:
|
||||
return []
|
||||
|
||||
scores: dict[str, float] = {}
|
||||
|
||||
results, _, _ = await executor.execute_query(
|
||||
"""
|
||||
UNWIND $node_uuids AS node_uuid
|
||||
MATCH (episode:Episodic)-[r:MENTIONS]->(n:Entity {uuid: node_uuid})
|
||||
RETURN count(*) AS score, n.uuid AS uuid
|
||||
""",
|
||||
node_uuids=node_uuids,
|
||||
)
|
||||
|
||||
for result in results:
|
||||
scores[result['uuid']] = result['score']
|
||||
|
||||
for uuid in node_uuids:
|
||||
if uuid not in scores:
|
||||
scores[uuid] = float('inf')
|
||||
|
||||
sorted_uuids = list(node_uuids)
|
||||
sorted_uuids.sort(key=lambda cur_uuid: scores[cur_uuid])
|
||||
|
||||
reranked_uuids = [u for u in sorted_uuids if scores[u] >= min_score]
|
||||
|
||||
if not reranked_uuids:
|
||||
return []
|
||||
|
||||
get_query = """
|
||||
MATCH (n:Entity)
|
||||
WHERE n.uuid IN $uuids
|
||||
RETURN
|
||||
""" + get_entity_node_return_query(GraphProvider.FALKORDB)
|
||||
|
||||
records, _, _ = await executor.execute_query(get_query, uuids=reranked_uuids)
|
||||
|
||||
node_map = {r['uuid']: entity_node_from_record(r) for r in records}
|
||||
return [node_map[u] for u in reranked_uuids if u in node_map]
|
||||
|
||||
# --- Filter builders ---
|
||||
|
||||
def build_node_search_filters(self, search_filters: SearchFilters) -> Any:
|
||||
filter_queries, filter_params = node_search_filter_query_constructor(
|
||||
search_filters, GraphProvider.FALKORDB
|
||||
)
|
||||
return {'filter_queries': filter_queries, 'filter_params': filter_params}
|
||||
|
||||
def build_edge_search_filters(self, search_filters: SearchFilters) -> Any:
|
||||
filter_queries, filter_params = edge_search_filter_query_constructor(
|
||||
search_filters, GraphProvider.FALKORDB
|
||||
)
|
||||
return {'filter_queries': filter_queries, 'filter_params': filter_params}
|
||||
|
||||
# --- Fulltext query builder ---
|
||||
|
||||
def build_fulltext_query(
|
||||
self,
|
||||
query: str,
|
||||
group_ids: list[str] | None = None,
|
||||
max_query_length: int = MAX_QUERY_LENGTH,
|
||||
) -> str:
|
||||
return _build_falkor_fulltext_query(query, group_ids, max_query_length)
|
||||
@@ -0,0 +1,444 @@
|
||||
"""
|
||||
Copyright 2024, Zep Software, Inc.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import datetime
|
||||
import logging
|
||||
from typing import TYPE_CHECKING, Any
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from falkordb import Graph as FalkorGraph
|
||||
from falkordb.asyncio import FalkorDB
|
||||
else:
|
||||
try:
|
||||
from falkordb import Graph as FalkorGraph
|
||||
from falkordb.asyncio import FalkorDB
|
||||
except ImportError:
|
||||
# If falkordb is not installed, raise an ImportError
|
||||
raise ImportError(
|
||||
'falkordb is required for FalkorDriver. '
|
||||
'Install it with: pip install graphiti-core[falkordb]'
|
||||
) from None
|
||||
|
||||
from graphiti_core.driver.driver import GraphDriver, GraphDriverSession, GraphProvider
|
||||
from graphiti_core.driver.falkordb import STOPWORDS as STOPWORDS
|
||||
from graphiti_core.driver.falkordb.operations.community_edge_ops import (
|
||||
FalkorCommunityEdgeOperations,
|
||||
)
|
||||
from graphiti_core.driver.falkordb.operations.community_node_ops import (
|
||||
FalkorCommunityNodeOperations,
|
||||
)
|
||||
from graphiti_core.driver.falkordb.operations.entity_edge_ops import FalkorEntityEdgeOperations
|
||||
from graphiti_core.driver.falkordb.operations.entity_node_ops import FalkorEntityNodeOperations
|
||||
from graphiti_core.driver.falkordb.operations.episode_node_ops import FalkorEpisodeNodeOperations
|
||||
from graphiti_core.driver.falkordb.operations.episodic_edge_ops import FalkorEpisodicEdgeOperations
|
||||
from graphiti_core.driver.falkordb.operations.graph_ops import FalkorGraphMaintenanceOperations
|
||||
from graphiti_core.driver.falkordb.operations.has_episode_edge_ops import (
|
||||
FalkorHasEpisodeEdgeOperations,
|
||||
)
|
||||
from graphiti_core.driver.falkordb.operations.next_episode_edge_ops import (
|
||||
FalkorNextEpisodeEdgeOperations,
|
||||
)
|
||||
from graphiti_core.driver.falkordb.operations.saga_node_ops import FalkorSagaNodeOperations
|
||||
from graphiti_core.driver.falkordb.operations.search_ops import FalkorSearchOperations
|
||||
from graphiti_core.driver.operations.community_edge_ops import CommunityEdgeOperations
|
||||
from graphiti_core.driver.operations.community_node_ops import CommunityNodeOperations
|
||||
from graphiti_core.driver.operations.entity_edge_ops import EntityEdgeOperations
|
||||
from graphiti_core.driver.operations.entity_node_ops import EntityNodeOperations
|
||||
from graphiti_core.driver.operations.episode_node_ops import EpisodeNodeOperations
|
||||
from graphiti_core.driver.operations.episodic_edge_ops import EpisodicEdgeOperations
|
||||
from graphiti_core.driver.operations.graph_ops import GraphMaintenanceOperations
|
||||
from graphiti_core.driver.operations.has_episode_edge_ops import HasEpisodeEdgeOperations
|
||||
from graphiti_core.driver.operations.next_episode_edge_ops import NextEpisodeEdgeOperations
|
||||
from graphiti_core.driver.operations.saga_node_ops import SagaNodeOperations
|
||||
from graphiti_core.driver.operations.search_ops import SearchOperations
|
||||
from graphiti_core.graph_queries import get_fulltext_indices, get_range_indices, get_vector_indices
|
||||
from graphiti_core.helpers import validate_group_ids
|
||||
from graphiti_core.utils.datetime_utils import convert_datetimes_to_strings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class FalkorDriverSession(GraphDriverSession):
|
||||
provider = GraphProvider.FALKORDB
|
||||
|
||||
def __init__(self, graph: FalkorGraph):
|
||||
self.graph = graph
|
||||
|
||||
async def __aenter__(self):
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc, tb):
|
||||
# No cleanup needed for Falkor, but method must exist
|
||||
pass
|
||||
|
||||
async def close(self):
|
||||
# No explicit close needed for FalkorDB, but method must exist
|
||||
pass
|
||||
|
||||
async def execute_write(self, func, *args, **kwargs):
|
||||
# Directly await the provided async function with `self` as the transaction/session
|
||||
return await func(self, *args, **kwargs)
|
||||
|
||||
async def run(self, query: str | list, **kwargs: Any) -> Any:
|
||||
# FalkorDB does not support argument for Label Set, so it's converted into an array of queries
|
||||
if isinstance(query, list):
|
||||
for cypher, params in query:
|
||||
params = convert_datetimes_to_strings(params)
|
||||
await self.graph.query(str(cypher), params) # type: ignore[reportUnknownArgumentType]
|
||||
else:
|
||||
params = dict(kwargs)
|
||||
params = convert_datetimes_to_strings(params)
|
||||
await self.graph.query(str(query), params) # type: ignore[reportUnknownArgumentType]
|
||||
# Assuming `graph.query` is async (ideal); otherwise, wrap in executor
|
||||
return None
|
||||
|
||||
|
||||
class FalkorDriver(GraphDriver):
|
||||
provider = GraphProvider.FALKORDB
|
||||
default_group_id: str = '\\_'
|
||||
fulltext_syntax: str = '@' # FalkorDB uses a redisearch-like syntax for fulltext queries
|
||||
aoss_client: None = None
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
host: str = 'localhost',
|
||||
port: int = 6379,
|
||||
username: str | None = None,
|
||||
password: str | None = None,
|
||||
falkor_db: FalkorDB | None = None,
|
||||
database: str = 'default_db',
|
||||
):
|
||||
"""
|
||||
Initialize the FalkorDB driver.
|
||||
|
||||
FalkorDB is a multi-tenant graph database.
|
||||
To connect, provide the host and port.
|
||||
The default parameters assume a local (on-premises) FalkorDB instance.
|
||||
|
||||
Args:
|
||||
host (str): The host where FalkorDB is running.
|
||||
port (int): The port on which FalkorDB is listening.
|
||||
username (str | None): The username for authentication (if required).
|
||||
password (str | None): The password for authentication (if required).
|
||||
falkor_db (FalkorDB | None): An existing FalkorDB instance to use instead of creating a new one.
|
||||
database (str): The name of the database to connect to. Defaults to 'default_db'.
|
||||
"""
|
||||
super().__init__()
|
||||
self._database = database
|
||||
if falkor_db is not None:
|
||||
# If a FalkorDB instance is provided, use it directly
|
||||
self.client = falkor_db
|
||||
else:
|
||||
self.client = FalkorDB(host=host, port=port, username=username, password=password)
|
||||
|
||||
# Instantiate FalkorDB operations
|
||||
self._entity_node_ops = FalkorEntityNodeOperations()
|
||||
self._episode_node_ops = FalkorEpisodeNodeOperations()
|
||||
self._community_node_ops = FalkorCommunityNodeOperations()
|
||||
self._saga_node_ops = FalkorSagaNodeOperations()
|
||||
self._entity_edge_ops = FalkorEntityEdgeOperations()
|
||||
self._episodic_edge_ops = FalkorEpisodicEdgeOperations()
|
||||
self._community_edge_ops = FalkorCommunityEdgeOperations()
|
||||
self._has_episode_edge_ops = FalkorHasEpisodeEdgeOperations()
|
||||
self._next_episode_edge_ops = FalkorNextEpisodeEdgeOperations()
|
||||
self._search_ops = FalkorSearchOperations()
|
||||
self._graph_ops = FalkorGraphMaintenanceOperations()
|
||||
|
||||
# Schedule the indices and constraints to be built
|
||||
try:
|
||||
# Try to get the current event loop
|
||||
loop = asyncio.get_running_loop()
|
||||
# Schedule the build_indices_and_constraints to run
|
||||
loop.create_task(self.build_indices_and_constraints())
|
||||
except RuntimeError:
|
||||
# No event loop running, this will be handled later
|
||||
pass
|
||||
|
||||
# --- Operations properties ---
|
||||
|
||||
@property
|
||||
def entity_node_ops(self) -> EntityNodeOperations:
|
||||
return self._entity_node_ops
|
||||
|
||||
@property
|
||||
def episode_node_ops(self) -> EpisodeNodeOperations:
|
||||
return self._episode_node_ops
|
||||
|
||||
@property
|
||||
def community_node_ops(self) -> CommunityNodeOperations:
|
||||
return self._community_node_ops
|
||||
|
||||
@property
|
||||
def saga_node_ops(self) -> SagaNodeOperations:
|
||||
return self._saga_node_ops
|
||||
|
||||
@property
|
||||
def entity_edge_ops(self) -> EntityEdgeOperations:
|
||||
return self._entity_edge_ops
|
||||
|
||||
@property
|
||||
def episodic_edge_ops(self) -> EpisodicEdgeOperations:
|
||||
return self._episodic_edge_ops
|
||||
|
||||
@property
|
||||
def community_edge_ops(self) -> CommunityEdgeOperations:
|
||||
return self._community_edge_ops
|
||||
|
||||
@property
|
||||
def has_episode_edge_ops(self) -> HasEpisodeEdgeOperations:
|
||||
return self._has_episode_edge_ops
|
||||
|
||||
@property
|
||||
def next_episode_edge_ops(self) -> NextEpisodeEdgeOperations:
|
||||
return self._next_episode_edge_ops
|
||||
|
||||
@property
|
||||
def search_ops(self) -> SearchOperations:
|
||||
return self._search_ops
|
||||
|
||||
@property
|
||||
def graph_ops(self) -> GraphMaintenanceOperations:
|
||||
return self._graph_ops
|
||||
|
||||
def _get_graph(self, graph_name: str | None) -> FalkorGraph:
|
||||
# FalkorDB requires a non-None database name for multi-tenant graphs; the default is "default_db"
|
||||
if graph_name is None:
|
||||
graph_name = self._database
|
||||
return self.client.select_graph(graph_name)
|
||||
|
||||
async def execute_query(self, cypher_query_, **kwargs: Any):
|
||||
graph = self._get_graph(self._database)
|
||||
|
||||
# Convert datetime objects to ISO strings (FalkorDB does not support datetime objects directly)
|
||||
params = convert_datetimes_to_strings(dict(kwargs))
|
||||
|
||||
try:
|
||||
result = await graph.query(cypher_query_, params) # type: ignore[reportUnknownArgumentType]
|
||||
except Exception as e:
|
||||
if 'already indexed' in str(e):
|
||||
# check if index already exists
|
||||
logger.info(f'Index already exists: {e}')
|
||||
return None
|
||||
logger.error(f'Error executing FalkorDB query: {e}\n{cypher_query_}\n{params}')
|
||||
raise
|
||||
|
||||
# Convert the result header to a list of strings
|
||||
header = [h[1] for h in result.header]
|
||||
|
||||
# Convert FalkorDB's result format (list of lists) to the format expected by Graphiti (list of dicts)
|
||||
records = []
|
||||
for row in result.result_set:
|
||||
record = {}
|
||||
for i, field_name in enumerate(header):
|
||||
if i < len(row):
|
||||
record[field_name] = row[i]
|
||||
else:
|
||||
# If there are more fields in header than values in row, set to None
|
||||
record[field_name] = None
|
||||
records.append(record)
|
||||
|
||||
return records, header, None
|
||||
|
||||
def session(self, database: str | None = None) -> GraphDriverSession:
|
||||
return FalkorDriverSession(self._get_graph(database))
|
||||
|
||||
async def close(self) -> None:
|
||||
"""Close the driver connection."""
|
||||
if hasattr(self.client, 'aclose'):
|
||||
await self.client.aclose() # type: ignore[reportUnknownMemberType]
|
||||
elif hasattr(self.client.connection, 'aclose'):
|
||||
await self.client.connection.aclose()
|
||||
elif hasattr(self.client.connection, 'close'):
|
||||
await self.client.connection.close()
|
||||
|
||||
async def delete_all_indexes(self) -> None:
|
||||
result = await self.execute_query('CALL db.indexes()')
|
||||
if not result:
|
||||
return
|
||||
|
||||
records, _, _ = result
|
||||
drop_tasks = []
|
||||
|
||||
for record in records:
|
||||
label = record['label']
|
||||
entity_type = record['entitytype']
|
||||
|
||||
for field_name, index_type in record['types'].items():
|
||||
if 'RANGE' in index_type:
|
||||
drop_tasks.append(self.execute_query(f'DROP INDEX ON :{label}({field_name})'))
|
||||
elif 'FULLTEXT' in index_type:
|
||||
if entity_type == 'NODE':
|
||||
drop_tasks.append(
|
||||
self.execute_query(
|
||||
f'DROP FULLTEXT INDEX FOR (n:{label}) ON (n.{field_name})'
|
||||
)
|
||||
)
|
||||
elif entity_type == 'RELATIONSHIP':
|
||||
drop_tasks.append(
|
||||
self.execute_query(
|
||||
f'DROP FULLTEXT INDEX FOR ()-[e:{label}]-() ON (e.{field_name})'
|
||||
)
|
||||
)
|
||||
|
||||
if drop_tasks:
|
||||
await asyncio.gather(*drop_tasks)
|
||||
|
||||
async def build_indices_and_constraints(self, delete_existing=False):
|
||||
if delete_existing:
|
||||
await self.delete_all_indexes()
|
||||
# PATCHED 2026-05-02 (BirdAI vendored patch): add vector indexes alongside
|
||||
# range and fulltext. FalkorDB supports native vector indexes via
|
||||
# db.idx.vector.queryNodes / queryRelationships; without these, similarity
|
||||
# search runs as full-table-scan cosine math in interpreted Cypher.
|
||||
index_queries = (
|
||||
get_range_indices(self.provider)
|
||||
+ get_fulltext_indices(self.provider)
|
||||
+ get_vector_indices(self.provider)
|
||||
)
|
||||
for query in index_queries:
|
||||
await self.execute_query(query)
|
||||
# Invalidate the search_ops vector-index existence cache so subsequent
|
||||
# similarity queries re-probe and discover the indexes we just built.
|
||||
try:
|
||||
from graphiti_core.driver.falkordb.operations.search_ops import (
|
||||
_invalidate_falkordb_vector_index_cache,
|
||||
)
|
||||
_invalidate_falkordb_vector_index_cache()
|
||||
except ImportError:
|
||||
# search_ops module not yet imported (cold start); cache is empty
|
||||
# by default, so no invalidation needed.
|
||||
pass
|
||||
|
||||
def clone(self, database: str) -> 'GraphDriver':
|
||||
"""
|
||||
Returns a shallow copy of this driver with a different default database.
|
||||
Reuses the same connection (e.g. FalkorDB, Neo4j).
|
||||
"""
|
||||
if database == self._database:
|
||||
cloned = self
|
||||
elif database == self.default_group_id:
|
||||
cloned = FalkorDriver(falkor_db=self.client)
|
||||
else:
|
||||
# Create a new instance of FalkorDriver with the same connection but a different database
|
||||
cloned = FalkorDriver(falkor_db=self.client, database=database)
|
||||
|
||||
return cloned
|
||||
|
||||
async def health_check(self) -> None:
|
||||
"""Check FalkorDB connectivity by running a simple query."""
|
||||
try:
|
||||
await self.execute_query('MATCH (n) RETURN 1 LIMIT 1')
|
||||
return None
|
||||
except Exception as e:
|
||||
print(f'FalkorDB health check failed: {e}')
|
||||
raise
|
||||
|
||||
@staticmethod
|
||||
def convert_datetimes_to_strings(obj):
|
||||
if isinstance(obj, dict):
|
||||
return {k: FalkorDriver.convert_datetimes_to_strings(v) for k, v in obj.items()}
|
||||
elif isinstance(obj, list):
|
||||
return [FalkorDriver.convert_datetimes_to_strings(item) for item in obj]
|
||||
elif isinstance(obj, tuple):
|
||||
return tuple(FalkorDriver.convert_datetimes_to_strings(item) for item in obj)
|
||||
elif isinstance(obj, datetime):
|
||||
return obj.isoformat()
|
||||
else:
|
||||
return obj
|
||||
|
||||
def sanitize(self, query: str) -> str:
|
||||
"""
|
||||
Replace FalkorDB special characters with whitespace.
|
||||
Based on FalkorDB tokenization rules: ,.<>{}[]"':;!@#$%^&*()-+=~
|
||||
"""
|
||||
# FalkorDB separator characters that break text into tokens
|
||||
separator_map = str.maketrans(
|
||||
{
|
||||
',': ' ',
|
||||
'.': ' ',
|
||||
'<': ' ',
|
||||
'>': ' ',
|
||||
'{': ' ',
|
||||
'}': ' ',
|
||||
'[': ' ',
|
||||
']': ' ',
|
||||
'"': ' ',
|
||||
"'": ' ',
|
||||
':': ' ',
|
||||
';': ' ',
|
||||
'!': ' ',
|
||||
'@': ' ',
|
||||
'#': ' ',
|
||||
'$': ' ',
|
||||
'%': ' ',
|
||||
'^': ' ',
|
||||
'&': ' ',
|
||||
'*': ' ',
|
||||
'(': ' ',
|
||||
')': ' ',
|
||||
'-': ' ',
|
||||
'+': ' ',
|
||||
'=': ' ',
|
||||
'~': ' ',
|
||||
'?': ' ',
|
||||
'|': ' ',
|
||||
'/': ' ',
|
||||
'\\': ' ',
|
||||
}
|
||||
)
|
||||
sanitized = query.translate(separator_map)
|
||||
# Clean up multiple spaces
|
||||
sanitized = ' '.join(sanitized.split())
|
||||
return sanitized
|
||||
|
||||
def build_fulltext_query(
|
||||
self, query: str, group_ids: list[str] | None = None, max_query_length: int = 128
|
||||
) -> str:
|
||||
"""
|
||||
Build a fulltext query string for FalkorDB using RedisSearch syntax.
|
||||
FalkorDB uses RedisSearch-like syntax where:
|
||||
- Field queries use @ prefix: @field:value
|
||||
- Multiple values for same field: (@field:value1|value2)
|
||||
- Text search doesn't need @ prefix for content fields
|
||||
- AND is implicit with space: (@group_id:value) (text)
|
||||
- OR uses pipe within parentheses: (@group_id:value1|value2)
|
||||
"""
|
||||
validate_group_ids(group_ids)
|
||||
|
||||
if group_ids is None or len(group_ids) == 0:
|
||||
group_filter = ''
|
||||
else:
|
||||
# Escape group_ids with quotes to prevent RediSearch syntax errors
|
||||
# with reserved words like "main" or special characters like hyphens
|
||||
escaped_group_ids = [f'"{gid}"' for gid in group_ids]
|
||||
group_values = '|'.join(escaped_group_ids)
|
||||
group_filter = f'(@group_id:{group_values})'
|
||||
|
||||
sanitized_query = self.sanitize(query)
|
||||
|
||||
# Remove stopwords and empty tokens from the sanitized query
|
||||
query_words = sanitized_query.split()
|
||||
filtered_words = [word for word in query_words if word and word.lower() not in STOPWORDS]
|
||||
sanitized_query = ' | '.join(filtered_words)
|
||||
|
||||
# If the query is too long return no query
|
||||
if len(sanitized_query.split(' ')) + len(group_ids or '') >= max_query_length:
|
||||
return ''
|
||||
|
||||
full_query = group_filter + ' (' + sanitized_query + ')'
|
||||
|
||||
return full_query
|
||||
@@ -0,0 +1,242 @@
|
||||
"""
|
||||
Database query utilities for different graph database backends.
|
||||
|
||||
This module provides database-agnostic query generation for Neo4j and FalkorDB,
|
||||
supporting index creation, fulltext search, and bulk operations.
|
||||
|
||||
PATCHED for FalkorDB native vector index support (BirdAI vendored patch,
|
||||
2026-05-02). Adds:
|
||||
- get_vector_indices(): CREATE VECTOR INDEX statements for FalkorDB
|
||||
- get_vector_search_query(): Cypher fragment for vector similarity using
|
||||
FalkorDB's db.idx.vector procedures, with fallback to cosine math when
|
||||
the index does not yet exist
|
||||
- VECTOR_INDEX_CANDIDATE_MULTIPLIER: over-fetch factor for vector index
|
||||
queries to handle filter rejections after index lookup
|
||||
|
||||
No changes to Neo4j or Kuzu code paths.
|
||||
"""
|
||||
|
||||
from typing_extensions import LiteralString
|
||||
|
||||
from graphiti_core.driver.driver import GraphProvider
|
||||
|
||||
# Mapping from Neo4j fulltext index names to FalkorDB node labels
|
||||
NEO4J_TO_FALKORDB_MAPPING = {
|
||||
'node_name_and_summary': 'Entity',
|
||||
'community_name': 'Community',
|
||||
'episode_content': 'Episodic',
|
||||
'edge_name_and_fact': 'RELATES_TO',
|
||||
}
|
||||
# Mapping from fulltext index names to Kuzu node labels
|
||||
INDEX_TO_LABEL_KUZU_MAPPING = {
|
||||
'node_name_and_summary': 'Entity',
|
||||
'community_name': 'Community',
|
||||
'episode_content': 'Episodic',
|
||||
'edge_name_and_fact': 'RelatesToNode_',
|
||||
}
|
||||
|
||||
# Vector index over-fetch multiplier. When a vector index search is
|
||||
# combined with WHERE filters (group_id, source_uuid, etc.), some of
|
||||
# the top-k index results may be filtered out. Over-fetching by this
|
||||
# factor preserves recall against the final LIMIT after filtering.
|
||||
# Conservative default; tunable per-deployment by editing this constant
|
||||
# or via environment-variable override at the driver level (future).
|
||||
VECTOR_INDEX_CANDIDATE_MULTIPLIER = 5
|
||||
|
||||
|
||||
def get_range_indices(provider: GraphProvider) -> list[LiteralString]:
|
||||
if provider == GraphProvider.FALKORDB:
|
||||
return [
|
||||
# Entity node
|
||||
'CREATE INDEX FOR (n:Entity) ON (n.uuid, n.group_id, n.name, n.created_at)',
|
||||
# Episodic node
|
||||
'CREATE INDEX FOR (n:Episodic) ON (n.uuid, n.group_id, n.created_at, n.valid_at)',
|
||||
# Community node
|
||||
'CREATE INDEX FOR (n:Community) ON (n.uuid)',
|
||||
# Saga node
|
||||
'CREATE INDEX FOR (n:Saga) ON (n.uuid, n.group_id, n.name)',
|
||||
# RELATES_TO edge
|
||||
'CREATE INDEX FOR ()-[e:RELATES_TO]-() ON (e.uuid, e.group_id, e.name, e.created_at, e.expired_at, e.valid_at, e.invalid_at)',
|
||||
# MENTIONS edge
|
||||
'CREATE INDEX FOR ()-[e:MENTIONS]-() ON (e.uuid, e.group_id)',
|
||||
# HAS_MEMBER edge
|
||||
'CREATE INDEX FOR ()-[e:HAS_MEMBER]-() ON (e.uuid)',
|
||||
# HAS_EPISODE edge
|
||||
'CREATE INDEX FOR ()-[e:HAS_EPISODE]-() ON (e.uuid, e.group_id)',
|
||||
# NEXT_EPISODE edge
|
||||
'CREATE INDEX FOR ()-[e:NEXT_EPISODE]-() ON (e.uuid, e.group_id)',
|
||||
]
|
||||
|
||||
if provider == GraphProvider.KUZU:
|
||||
return []
|
||||
|
||||
return [
|
||||
'CREATE INDEX entity_uuid IF NOT EXISTS FOR (n:Entity) ON (n.uuid)',
|
||||
'CREATE INDEX episode_uuid IF NOT EXISTS FOR (n:Episodic) ON (n.uuid)',
|
||||
'CREATE INDEX community_uuid IF NOT EXISTS FOR (n:Community) ON (n.uuid)',
|
||||
'CREATE INDEX saga_uuid IF NOT EXISTS FOR (n:Saga) ON (n.uuid)',
|
||||
'CREATE INDEX relation_uuid IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.uuid)',
|
||||
'CREATE INDEX mention_uuid IF NOT EXISTS FOR ()-[e:MENTIONS]-() ON (e.uuid)',
|
||||
'CREATE INDEX has_member_uuid IF NOT EXISTS FOR ()-[e:HAS_MEMBER]-() ON (e.uuid)',
|
||||
'CREATE INDEX has_episode_uuid IF NOT EXISTS FOR ()-[e:HAS_EPISODE]-() ON (e.uuid)',
|
||||
'CREATE INDEX next_episode_uuid IF NOT EXISTS FOR ()-[e:NEXT_EPISODE]-() ON (e.uuid)',
|
||||
'CREATE INDEX entity_group_id IF NOT EXISTS FOR (n:Entity) ON (n.group_id)',
|
||||
'CREATE INDEX episode_group_id IF NOT EXISTS FOR (n:Episodic) ON (n.group_id)',
|
||||
'CREATE INDEX community_group_id IF NOT EXISTS FOR (n:Community) ON (n.group_id)',
|
||||
'CREATE INDEX saga_group_id IF NOT EXISTS FOR (n:Saga) ON (n.group_id)',
|
||||
'CREATE INDEX relation_group_id IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.group_id)',
|
||||
'CREATE INDEX mention_group_id IF NOT EXISTS FOR ()-[e:MENTIONS]-() ON (e.group_id)',
|
||||
'CREATE INDEX has_episode_group_id IF NOT EXISTS FOR ()-[e:HAS_EPISODE]-() ON (e.group_id)',
|
||||
'CREATE INDEX next_episode_group_id IF NOT EXISTS FOR ()-[e:NEXT_EPISODE]-() ON (e.group_id)',
|
||||
'CREATE INDEX name_entity_index IF NOT EXISTS FOR (n:Entity) ON (n.name)',
|
||||
'CREATE INDEX saga_name IF NOT EXISTS FOR (n:Saga) ON (n.name)',
|
||||
'CREATE INDEX created_at_entity_index IF NOT EXISTS FOR (n:Entity) ON (n.created_at)',
|
||||
'CREATE INDEX created_at_episodic_index IF NOT EXISTS FOR (n:Episodic) ON (n.created_at)',
|
||||
'CREATE INDEX valid_at_episodic_index IF NOT EXISTS FOR (n:Episodic) ON (n.valid_at)',
|
||||
'CREATE INDEX name_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.name)',
|
||||
'CREATE INDEX created_at_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.created_at)',
|
||||
'CREATE INDEX expired_at_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.expired_at)',
|
||||
'CREATE INDEX valid_at_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.valid_at)',
|
||||
'CREATE INDEX invalid_at_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.invalid_at)',
|
||||
]
|
||||
|
||||
|
||||
def get_fulltext_indices(provider: GraphProvider) -> list[LiteralString]:
|
||||
if provider == GraphProvider.FALKORDB:
|
||||
from typing import cast
|
||||
|
||||
from graphiti_core.driver.falkordb import STOPWORDS
|
||||
|
||||
# Convert to string representation for embedding in queries
|
||||
stopwords_str = str(STOPWORDS)
|
||||
|
||||
# Use type: ignore to satisfy LiteralString requirement while maintaining single source of truth
|
||||
return cast(
|
||||
list[LiteralString],
|
||||
[
|
||||
f"""CALL db.idx.fulltext.createNodeIndex(
|
||||
{{
|
||||
label: 'Episodic',
|
||||
stopwords: {stopwords_str}
|
||||
}},
|
||||
'content', 'source', 'source_description', 'group_id'
|
||||
)""",
|
||||
f"""CALL db.idx.fulltext.createNodeIndex(
|
||||
{{
|
||||
label: 'Entity',
|
||||
stopwords: {stopwords_str}
|
||||
}},
|
||||
'name', 'summary', 'group_id'
|
||||
)""",
|
||||
f"""CALL db.idx.fulltext.createNodeIndex(
|
||||
{{
|
||||
label: 'Community',
|
||||
stopwords: {stopwords_str}
|
||||
}},
|
||||
'name', 'group_id'
|
||||
)""",
|
||||
"""CREATE FULLTEXT INDEX FOR ()-[e:RELATES_TO]-() ON (e.name, e.fact, e.group_id)""",
|
||||
],
|
||||
)
|
||||
|
||||
if provider == GraphProvider.KUZU:
|
||||
return [
|
||||
"CALL CREATE_FTS_INDEX('Episodic', 'episode_content', ['content', 'source', 'source_description']);",
|
||||
"CALL CREATE_FTS_INDEX('Entity', 'node_name_and_summary', ['name', 'summary']);",
|
||||
"CALL CREATE_FTS_INDEX('Community', 'community_name', ['name']);",
|
||||
"CALL CREATE_FTS_INDEX('RelatesToNode_', 'edge_name_and_fact', ['name', 'fact']);",
|
||||
]
|
||||
|
||||
return [
|
||||
"""CREATE FULLTEXT INDEX episode_content IF NOT EXISTS
|
||||
FOR (e:Episodic) ON EACH [e.content, e.source, e.source_description, e.group_id]""",
|
||||
"""CREATE FULLTEXT INDEX node_name_and_summary IF NOT EXISTS
|
||||
FOR (n:Entity) ON EACH [n.name, n.summary, n.group_id]""",
|
||||
"""CREATE FULLTEXT INDEX community_name IF NOT EXISTS
|
||||
FOR (n:Community) ON EACH [n.name, n.group_id]""",
|
||||
"""CREATE FULLTEXT INDEX edge_name_and_fact IF NOT EXISTS
|
||||
FOR ()-[e:RELATES_TO]-() ON EACH [e.name, e.fact, e.group_id]""",
|
||||
]
|
||||
|
||||
|
||||
def get_vector_indices(provider: GraphProvider, dimension: int = 384) -> list[LiteralString]:
|
||||
"""Return CREATE VECTOR INDEX statements for the given provider.
|
||||
|
||||
For FalkorDB: creates HNSW vector indexes on Entity.name_embedding,
|
||||
RELATES_TO.fact_embedding, and Community.name_embedding. Backed by
|
||||
FalkorDB's native vector index (db.idx.vector.queryNodes /
|
||||
queryRelationships).
|
||||
|
||||
For Neo4j and Kuzu: returns an empty list. Those backends create vector
|
||||
indexes via different mechanisms (Neo4j auto-creates them when needed
|
||||
via its vector.similarity.cosine function; Kuzu uses array_cosine_similarity
|
||||
and does not require pre-built vector indexes for graphiti-core's usage).
|
||||
|
||||
Args:
|
||||
provider: The graph database provider.
|
||||
dimension: Embedding dimension. Defaults to 384 (all-MiniLM-L6-v2).
|
||||
Embedders with different dimensions should pass their own value
|
||||
through driver configuration. graphiti-core's default embedder
|
||||
is 1536 (OpenAI ada-002); BirdAI uses 384 (sentence-transformers).
|
||||
|
||||
Returns:
|
||||
List of CREATE VECTOR INDEX statements. Idempotent at FalkorDB level
|
||||
if the index already exists with matching options.
|
||||
"""
|
||||
if provider == GraphProvider.FALKORDB:
|
||||
from typing import cast
|
||||
return cast(
|
||||
list[LiteralString],
|
||||
[
|
||||
f"CREATE VECTOR INDEX FOR (n:Entity) ON (n.name_embedding) "
|
||||
f"OPTIONS {{dimension: {dimension}, similarityFunction: 'cosine'}}",
|
||||
f"CREATE VECTOR INDEX FOR ()-[e:RELATES_TO]-() ON (e.fact_embedding) "
|
||||
f"OPTIONS {{dimension: {dimension}, similarityFunction: 'cosine'}}",
|
||||
f"CREATE VECTOR INDEX FOR (n:Community) ON (n.name_embedding) "
|
||||
f"OPTIONS {{dimension: {dimension}, similarityFunction: 'cosine'}}",
|
||||
],
|
||||
)
|
||||
|
||||
return []
|
||||
|
||||
|
||||
def get_nodes_query(name: str, query: str, limit: int, provider: GraphProvider) -> str:
|
||||
if provider == GraphProvider.FALKORDB:
|
||||
label = NEO4J_TO_FALKORDB_MAPPING[name]
|
||||
return f"CALL db.idx.fulltext.queryNodes('{label}', {query})"
|
||||
|
||||
if provider == GraphProvider.KUZU:
|
||||
label = INDEX_TO_LABEL_KUZU_MAPPING[name]
|
||||
return f"CALL QUERY_FTS_INDEX('{label}', '{name}', {query}, TOP := $limit)"
|
||||
|
||||
return f'CALL db.index.fulltext.queryNodes("{name}", {query}, {{limit: $limit}})'
|
||||
|
||||
|
||||
def get_vector_cosine_func_query(vec1, vec2, provider: GraphProvider) -> str:
|
||||
"""Return a Cypher fragment for cosine similarity score in [0, 1].
|
||||
|
||||
PRESERVED for backward compatibility and as fallback when vector indexes
|
||||
do not yet exist on the FalkorDB backend. New code paths should prefer
|
||||
get_vector_search_query() which uses the native vector index when
|
||||
available.
|
||||
"""
|
||||
if provider == GraphProvider.FALKORDB:
|
||||
# FalkorDB uses a different syntax for regular cosine similarity and Neo4j uses normalized cosine similarity
|
||||
return f'(2 - vec.cosineDistance({vec1}, vecf32({vec2})))/2'
|
||||
|
||||
if provider == GraphProvider.KUZU:
|
||||
return f'array_cosine_similarity({vec1}, {vec2})'
|
||||
|
||||
return f'vector.similarity.cosine({vec1}, {vec2})'
|
||||
|
||||
|
||||
def get_relationships_query(name: str, limit: int, provider: GraphProvider) -> str:
|
||||
if provider == GraphProvider.FALKORDB:
|
||||
label = NEO4J_TO_FALKORDB_MAPPING[name]
|
||||
return f"CALL db.idx.fulltext.queryRelationships('{label}', $query)"
|
||||
|
||||
if provider == GraphProvider.KUZU:
|
||||
label = INDEX_TO_LABEL_KUZU_MAPPING[name]
|
||||
return f"CALL QUERY_FTS_INDEX('{label}', '{name}', cast($query AS STRING), TOP := $limit)"
|
||||
|
||||
return f'CALL db.index.fulltext.queryRelationships("{name}", $query, {{limit: $limit}})'
|
||||
+639
-98
@@ -1,12 +1,14 @@
|
||||
import os
|
||||
import re
|
||||
import json
|
||||
import sqlite3
|
||||
import subprocess
|
||||
import hashlib
|
||||
import requests
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from datetime import datetime, timedelta
|
||||
from dotenv import load_dotenv
|
||||
from sentence_transformers import SentenceTransformer
|
||||
from sentence_transformers import SentenceTransformer, CrossEncoder
|
||||
import anthropic
|
||||
from fastapi import FastAPI, Request, Response, Depends, HTTPException, BackgroundTasks
|
||||
import psycopg2
|
||||
@@ -38,6 +40,19 @@ load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
MEMORY_PATH = Path.home() / "aaronai" / "memory.md"
|
||||
CONVERSATIONS_DB = str(Path.home() / "aaronai" / "conversations.db")
|
||||
|
||||
def _connect(path):
|
||||
conn = sqlite3.connect(path, timeout=5.0)
|
||||
conn.execute("PRAGMA synchronous=NORMAL")
|
||||
conn.execute("PRAGMA foreign_keys=ON")
|
||||
return conn
|
||||
|
||||
def _connect_conversations():
|
||||
return _connect(CONVERSATIONS_DB)
|
||||
|
||||
def _connect_sessions():
|
||||
return _connect(SESSIONS_DB)
|
||||
|
||||
SETTINGS_PATH = Path.home() / "aaronai" / "settings.json"
|
||||
WATCHER_LOG = str(Path.home() / "aaronai" / "watcher.log")
|
||||
WATCHER_STATE = str(Path.home() / "aaronai" / "watcher_state.json")
|
||||
@@ -73,11 +88,12 @@ WHISPER_PROMPT = (
|
||||
whisper_model = None
|
||||
if HAS_WHISPER:
|
||||
try:
|
||||
whisper_model = WhisperModel("large-v3", device="cpu", compute_type="int8", cpu_threads=8)
|
||||
whisper_model = WhisperModel("distil-large-v3", device="cpu", compute_type="int8", cpu_threads=4)
|
||||
print("Whisper model loaded")
|
||||
except Exception as e:
|
||||
print(f"Whisper not available: {e}")
|
||||
embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
||||
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
|
||||
# ChromaDB removed — using pgvector
|
||||
anthropic_client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
|
||||
|
||||
@@ -108,22 +124,65 @@ economical, specific, never performative. When answering questions,
|
||||
cite sources and acknowledge uncertainty rather than filling gaps with
|
||||
plausible-sounding content.
|
||||
|
||||
You have access to his complete document corpus, conversation history,
|
||||
and a persistent memory file that carries his current context. Treat
|
||||
the memory file as ground truth for his present situation. Use web
|
||||
search automatically when current information is needed. Never
|
||||
re-brief on context that's already in memory or documents.
|
||||
You have a persistent memory file (always present below) that carries
|
||||
Aaron's current context — treat it as ground truth for his present
|
||||
situation.
|
||||
|
||||
For anything beyond what's in memory, you have a retrieve_documents
|
||||
tool that searches his full knowledge base: personal documents,
|
||||
reading library, conversation transcripts, and journal entries. Call
|
||||
it whenever you need concrete information — names, dates, project
|
||||
specifics, prior thinking, exhibition records, syllabi, anything you
|
||||
don't already know. For compound questions, call it multiple times
|
||||
with different concrete queries; one call per distinct information
|
||||
need. Prefer specific tokens (named entities, project names, course
|
||||
codes) over abstract instructional phrasing — search "FWN3D
|
||||
consulting" not "my work." Results are unfiltered and ranked by
|
||||
semantic similarity; judge each chunk for relevance and ignore
|
||||
irrelevant hits rather than forcing them into the answer.
|
||||
|
||||
You also have a search_facts tool that queries a knowledge graph of
|
||||
atomic facts about Aaron's entities and their relationships. The graph
|
||||
was populated through early May 2026 and is not currently being
|
||||
updated; treat it as a *historical* layer that holds biographical
|
||||
content (career, projects, consulting), exhibition records, key
|
||||
people, dossier-era claims, and time-stamped facts with explicit
|
||||
validity windows. For biographical or relational questions ("write
|
||||
me a bio", "what's the FWN3D / HVAMC relationship", "who did I
|
||||
consult for at IBM"), call search_facts *in addition to*
|
||||
retrieve_documents — the two return complementary shapes (atomic
|
||||
facts vs. document passages). For current-state questions, the
|
||||
persistent memory file is more authoritative than the graph.
|
||||
|
||||
When Aaron asks for a document file — bio, cover letter, statement,
|
||||
CV section, anything he wants to send or edit outside chat — produce
|
||||
the full text as your chat reply first. NEVER call save_document on
|
||||
the same turn as the initial request, even when Aaron's phrasing
|
||||
includes words like "save", "output", "write", or "as docx/pdf" in
|
||||
the original ask. Those are part of the topic, not a save approval.
|
||||
The first call to save_document only happens in a *later* turn,
|
||||
after Aaron has read the draft and explicitly approves it — examples:
|
||||
"save it", "yes save it", "looks good, write it out", "go ahead".
|
||||
If Aaron asks for revisions, iterate in chat without calling
|
||||
save_document. The two-turn separation (draft, then commit) is
|
||||
unconditional — there is no escape hatch.
|
||||
|
||||
Use web search automatically when current external information is
|
||||
needed. Never re-brief on context that's already in memory or
|
||||
retrieved chunks.
|
||||
|
||||
When making factual claims about Aaron — his history, credentials, locations, dates, relationships, projects, or any specific event — you must ground the claim in a specific retrieved document or the memory file. Cite the source by name inline. If no source supports the claim, say so explicitly rather than filling the gap with plausible-sounding content. Do not confabulate. If you are inferring rather than citing, mark it as inference."""
|
||||
|
||||
# Auth configuration
|
||||
import os
|
||||
SESSION_PASSWORD = os.getenv("AARON_AI_PASSWORD", "changeme")
|
||||
SESSION_MAX_AGE_SECONDS = 60 * 60 * 24 * 365
|
||||
SESSIONS_DB = str(Path.home() / "aaronai" / "sessions.db")
|
||||
|
||||
def _init_sessions():
|
||||
conn = sqlite3.connect(SESSIONS_DB)
|
||||
conn = _connect_sessions()
|
||||
conn.execute("CREATE TABLE IF NOT EXISTS sessions (token TEXT PRIMARY KEY, created_at TEXT)")
|
||||
conn.execute("PRAGMA journal_mode=WAL")
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
@@ -136,20 +195,23 @@ def hash_password(password: str) -> str:
|
||||
return hashlib.sha256(password.encode()).hexdigest()
|
||||
|
||||
def save_session(token: str):
|
||||
conn = sqlite3.connect(SESSIONS_DB)
|
||||
conn = _connect_sessions()
|
||||
conn.execute("INSERT OR REPLACE INTO sessions VALUES (?, ?)", (token, datetime.now().isoformat()))
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
def delete_session(token: str):
|
||||
conn = sqlite3.connect(SESSIONS_DB)
|
||||
conn = _connect_sessions()
|
||||
conn.execute("DELETE FROM sessions WHERE token = ?", (token,))
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
def session_exists(token: str) -> bool:
|
||||
conn = sqlite3.connect(SESSIONS_DB)
|
||||
row = conn.execute("SELECT 1 FROM sessions WHERE token = ?", (token,)).fetchone()
|
||||
conn = _connect_sessions()
|
||||
cutoff = (datetime.now() - timedelta(seconds=SESSION_MAX_AGE_SECONDS)).isoformat()
|
||||
conn.execute("DELETE FROM sessions WHERE created_at < ?", (cutoff,))
|
||||
conn.commit()
|
||||
row = conn.execute("SELECT 1 FROM sessions WHERE token = ? AND created_at >= ?", (token, cutoff)).fetchone()
|
||||
conn.close()
|
||||
return row is not None
|
||||
|
||||
@@ -163,7 +225,7 @@ def require_auth(request: Request):
|
||||
return token
|
||||
|
||||
def init_conversations_db():
|
||||
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||
conn = _connect_conversations()
|
||||
c = conn.cursor()
|
||||
c.execute('''CREATE TABLE IF NOT EXISTS conversations (
|
||||
id TEXT PRIMARY KEY,
|
||||
@@ -182,6 +244,8 @@ def init_conversations_db():
|
||||
timestamp TEXT NOT NULL,
|
||||
FOREIGN KEY (conversation_id) REFERENCES conversations(id)
|
||||
)''')
|
||||
c.execute("PRAGMA journal_mode=WAL")
|
||||
c.execute("CREATE INDEX IF NOT EXISTS idx_messages_conv_ts ON messages(conversation_id, timestamp DESC)")
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
@@ -223,34 +287,131 @@ def remove_from_memory(item):
|
||||
save_memory("\n".join(filtered))
|
||||
return len(lines) - len(filtered)
|
||||
|
||||
def retrieve_context(query, n_results=8):
|
||||
"""Pure semantic retrieval over pgvector. Top-N by cosine similarity, threshold 0.3.
|
||||
No CV pinning, no keyword routing — see architecture doc substrate-dependency section.
|
||||
Substrate-level workarounds (entity-keyed routing, hybrid retrieval) live at the
|
||||
Graphiti layer, not as wrapper logic above pgvector."""
|
||||
HYBRID_CANDIDATES = 30
|
||||
RRF_K = 60
|
||||
FINAL_LIMIT = 8
|
||||
MAX_RETRIEVALS_PER_TURN = 5
|
||||
MAX_CITED_SOURCES = 5
|
||||
|
||||
_TSQUERY_SANITIZE_RE = re.compile(r"[^\w\s\"'-]")
|
||||
|
||||
|
||||
def _websearch_query(text: str) -> str:
|
||||
"""Strip characters websearch_to_tsquery doesn't handle cleanly. Quoted
|
||||
phrases and 'or' are preserved by the function itself."""
|
||||
return _TSQUERY_SANITIZE_RE.sub(" ", text).strip()
|
||||
|
||||
|
||||
def _rerank(query: str, candidates: list[tuple]) -> list[tuple]:
|
||||
"""Cross-encoder rerank. Candidates are (id, document, source, folder, created_at)
|
||||
tuples. Returns the same tuples reordered by reranker score with created_at as
|
||||
secondary key — so when two chunks score similarly the newer one wins, which
|
||||
keeps memory/journal files biased toward the latest snapshot."""
|
||||
if not candidates:
|
||||
return []
|
||||
pairs = [(query, row[1]) for row in candidates]
|
||||
scores = reranker.predict(pairs)
|
||||
return [row for row, _ in sorted(
|
||||
zip(candidates, scores),
|
||||
key=lambda x: (float(x[1]), x[0][4] or ""),
|
||||
reverse=True,
|
||||
)]
|
||||
|
||||
|
||||
def _format_source(source: str, folder: str) -> str:
|
||||
"""Surface folder context to the LLM so it can disambiguate same-named files
|
||||
(e.g., 21 different CV.docx files across job-application folders)."""
|
||||
source = source or "unknown"
|
||||
if folder and folder not in ("", "."):
|
||||
return f"{folder}/{source}"
|
||||
return source
|
||||
|
||||
|
||||
def _dedup_key(doc: str) -> str:
|
||||
"""Collapse near-duplicates by content. Files copied to multiple folders
|
||||
produce byte-identical chunks; this catches those without affecting
|
||||
legitimately-different chunks of the same source (e.g., separate sections
|
||||
of a conversation)."""
|
||||
return hashlib.md5(doc[:300].lower().encode("utf-8", "ignore")).hexdigest()
|
||||
|
||||
|
||||
def retrieve_context(query, n_results=FINAL_LIMIT):
|
||||
"""Hybrid retrieval (dense + lexical, RRF fused) followed by cross-encoder rerank.
|
||||
|
||||
- Dense (pgvector) handles paraphrase / semantic similarity.
|
||||
- Lexical (tsvector) catches rare named tokens (FWN3D, Sono-Tek, course codes)
|
||||
the embedding model has no signal for.
|
||||
- RRF combines the two rankings without calibrating score scales.
|
||||
- Cross-encoder rerank scores each (query, chunk) pair jointly.
|
||||
- Near-duplicate collapse on output so top-N slots aren't burned by
|
||||
multi-folder copies of the same file.
|
||||
|
||||
No type or folder filtering: imposing a taxonomy at retrieval time is a
|
||||
heuristic we've explicitly rejected. The reranker ranks, the caller (LLM)
|
||||
decides what's relevant to its task."""
|
||||
query_embedding = embedder.encode([query]).tolist()[0]
|
||||
ts_query = _websearch_query(query)
|
||||
|
||||
context_pieces = []
|
||||
sources = []
|
||||
|
||||
try:
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
|
||||
cur.execute("""
|
||||
SELECT document, source, 1 - (embedding <=> %s::vector) as similarity
|
||||
SELECT id, document, source, metadata->>'folder' AS folder, created_at
|
||||
FROM embeddings
|
||||
ORDER BY embedding <=> %s::vector
|
||||
LIMIT %s
|
||||
""", (query_embedding, query_embedding, n_results))
|
||||
for doc, source, similarity in cur.fetchall():
|
||||
if similarity > 0.3:
|
||||
context_pieces.append(doc)
|
||||
sources.append(source or "unknown")
|
||||
""", (query_embedding, HYBRID_CANDIDATES))
|
||||
dense_hits = cur.fetchall()
|
||||
|
||||
lexical_hits = []
|
||||
if ts_query:
|
||||
cur.execute("""
|
||||
SELECT id, document, source, metadata->>'folder' AS folder, created_at
|
||||
FROM embeddings
|
||||
WHERE to_tsvector('english', document)
|
||||
@@ websearch_to_tsquery('english', %s)
|
||||
ORDER BY ts_rank(to_tsvector('english', document),
|
||||
websearch_to_tsquery('english', %s)) DESC
|
||||
LIMIT %s
|
||||
""", (ts_query, ts_query, HYBRID_CANDIDATES))
|
||||
lexical_hits = cur.fetchall()
|
||||
|
||||
pg.close()
|
||||
|
||||
scores = {}
|
||||
rows_by_id = {}
|
||||
for rank, row in enumerate(dense_hits):
|
||||
scores[row[0]] = scores.get(row[0], 0) + 1.0 / (RRF_K + rank + 1)
|
||||
rows_by_id[row[0]] = row
|
||||
for rank, row in enumerate(lexical_hits):
|
||||
scores[row[0]] = scores.get(row[0], 0) + 1.0 / (RRF_K + rank + 1)
|
||||
rows_by_id[row[0]] = row
|
||||
|
||||
rrf_ranked = sorted(scores.items(), key=lambda kv: kv[1], reverse=True)
|
||||
candidates = [rows_by_id[doc_id] for doc_id, _ in rrf_ranked]
|
||||
|
||||
seen = set()
|
||||
for _id, doc, source, folder, _created_at in _rerank(query, candidates):
|
||||
key = _dedup_key(doc)
|
||||
if key in seen:
|
||||
continue
|
||||
seen.add(key)
|
||||
context_pieces.append(doc)
|
||||
sources.append(_format_source(source, folder))
|
||||
if len(context_pieces) >= n_results:
|
||||
break
|
||||
|
||||
except Exception as e:
|
||||
print(f"pgvector retrieval error: {e}")
|
||||
print(f"hybrid retrieval error: {e}")
|
||||
|
||||
return context_pieces, sources
|
||||
|
||||
def get_conversation_history(conversation_id, limit=20):
|
||||
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||
conn = _connect_conversations()
|
||||
c = conn.cursor()
|
||||
c.execute('''SELECT role, content FROM messages
|
||||
WHERE conversation_id = ?
|
||||
@@ -260,7 +421,7 @@ def get_conversation_history(conversation_id, limit=20):
|
||||
return [{"role": r[0], "content": r[1]} for r in reversed(rows)]
|
||||
|
||||
def save_message(conversation_id, role, content, sources=None):
|
||||
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||
conn = _connect_conversations()
|
||||
c = conn.cursor()
|
||||
msg_id = hashlib.md5(f"{conversation_id}{role}{datetime.now().isoformat()}".encode()).hexdigest()
|
||||
timestamp = datetime.now().isoformat()
|
||||
@@ -274,7 +435,7 @@ def save_message(conversation_id, role, content, sources=None):
|
||||
conn.close()
|
||||
|
||||
def create_conversation(title="New conversation"):
|
||||
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||
conn = _connect_conversations()
|
||||
c = conn.cursor()
|
||||
conv_id = hashlib.md5(f"{datetime.now().isoformat()}".encode()).hexdigest()[:16]
|
||||
now = datetime.now().isoformat()
|
||||
@@ -284,50 +445,370 @@ def create_conversation(title="New conversation"):
|
||||
conn.close()
|
||||
return conv_id
|
||||
|
||||
NEXTCLOUD_URL = os.getenv("NEXTCLOUD_URL", "https://nextcloud.aaronnelson.studio")
|
||||
NEXTCLOUD_USER = os.getenv("NEXTCLOUD_USER", "aaron")
|
||||
NEXTCLOUD_PASSWORD = os.getenv("NEXTCLOUD_PASSWORD", "")
|
||||
DRAFTS_WEBDAV = f"{NEXTCLOUD_URL}/remote.php/dav/files/{NEXTCLOUD_USER}/Drafts"
|
||||
|
||||
_FILENAME_SAFE_RE = re.compile(r"[^A-Za-z0-9_\-\. ]")
|
||||
|
||||
|
||||
GRAPHITI_URL = os.getenv("GRAPHITI_URL", "http://localhost:8001")
|
||||
GRAPHITI_GROUP_ID = os.getenv("GRAPHITI_GROUP_ID", "aaron")
|
||||
|
||||
|
||||
SEARCH_FACTS_TOOL = {
|
||||
"name": "search_facts",
|
||||
"description": (
|
||||
"Search Aaron's knowledge graph for atomic facts about entities and "
|
||||
"their relationships. The graph holds time-stamped facts captured up "
|
||||
"to early May 2026 — biographical content (career, projects, "
|
||||
"consulting), exhibition history, key relationships, dossier-era "
|
||||
"claims. Returns short sentence-shaped facts with valid_at / "
|
||||
"invalid_at timestamps so you can distinguish current state from "
|
||||
"superseded history. Useful for: bios, 'who did I consult for', "
|
||||
"'what's the relationship between X and Y', any question shaped like "
|
||||
"a relational lookup. Complements retrieve_documents (which returns "
|
||||
"longer chunk passages). Call this *in addition to* retrieve_documents "
|
||||
"for biographical or relational questions — the two return "
|
||||
"different shapes of evidence. The graph hasn't been updated since "
|
||||
"early May 2026; for current-state questions, the persistent memory "
|
||||
"file or recent documents are more authoritative."
|
||||
),
|
||||
"input_schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"query": {
|
||||
"type": "string",
|
||||
"description": "The fact-shaped query. Concrete entity names work best.",
|
||||
},
|
||||
},
|
||||
"required": ["query"],
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def _push_chat_turn_to_graphiti(conversation_id, user_message, assistant_message):
|
||||
"""Async fire-and-forget push of a chat turn into Graphiti. Single episode,
|
||||
default extraction, no custom_extraction_instructions. Takes ~20 min in
|
||||
the background against the current ~4,300-entity graph; the chat caller
|
||||
is not gated on this. Errors are logged, never raised."""
|
||||
if os.getenv("SKIP_GRAPHITI_CHAT_PUSH"):
|
||||
return
|
||||
if not (user_message or "").strip() and not (assistant_message or "").strip():
|
||||
return
|
||||
import threading
|
||||
from datetime import datetime as _dt
|
||||
|
||||
def _work():
|
||||
try:
|
||||
episode_name = f"chat-{conversation_id[:8]}-{_dt.now().strftime('%Y%m%dT%H%M%S')}"
|
||||
content = (
|
||||
f"User: {user_message}\n\n"
|
||||
f"Assistant: {assistant_message}"
|
||||
)
|
||||
payload = {
|
||||
"name": episode_name,
|
||||
"content": content,
|
||||
"source_description": f"chat turn (conversation {conversation_id})",
|
||||
"timestamp": _dt.now().isoformat(),
|
||||
"group_id": GRAPHITI_GROUP_ID,
|
||||
}
|
||||
# Long timeout — sidecar add_episode against the current graph
|
||||
# is empirically ~20 min wall-clock. We're patient; chat isn't.
|
||||
r = requests.post(f"{GRAPHITI_URL}/episodes", json=payload, timeout=1800)
|
||||
if r.status_code == 200:
|
||||
print(f"[graphiti-push] turn ingested: {episode_name}", flush=True)
|
||||
else:
|
||||
print(f"[graphiti-push] non-200 ({r.status_code}) for {episode_name}: {r.text[:200]}", flush=True)
|
||||
except requests.RequestException as e:
|
||||
print(f"[graphiti-push] request failed: {e}", flush=True)
|
||||
except Exception as e:
|
||||
print(f"[graphiti-push] unexpected error: {e}", flush=True)
|
||||
|
||||
threading.Thread(target=_work, daemon=True).start()
|
||||
|
||||
|
||||
def _execute_search_facts(tool_input):
|
||||
"""Hit Graphiti /search, format the results as text for Claude."""
|
||||
query = (tool_input or {}).get("query", "").strip()
|
||||
if not query:
|
||||
return "No query provided."
|
||||
try:
|
||||
r = requests.get(
|
||||
f"{GRAPHITI_URL}/search",
|
||||
params={"query": query, "limit": 8, "group_id": GRAPHITI_GROUP_ID},
|
||||
timeout=15,
|
||||
)
|
||||
except requests.RequestException as e:
|
||||
return f"search_facts: Graphiti unreachable ({e})."
|
||||
if r.status_code != 200:
|
||||
return f"search_facts: Graphiti returned {r.status_code}."
|
||||
results = r.json().get("results", [])
|
||||
if not results:
|
||||
return f"No facts found for {query!r}."
|
||||
lines = []
|
||||
for i, f in enumerate(results, 1):
|
||||
fact = f.get("fact", "").strip()
|
||||
valid_at = f.get("valid_at") or "?"
|
||||
invalid_at = f.get("invalid_at")
|
||||
validity = (f"valid {valid_at}" + (f" → superseded {invalid_at}"
|
||||
if invalid_at and invalid_at != "None" else ""))
|
||||
lines.append(f"[{i}] {fact} ({validity})")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
SAVE_DOCUMENT_TOOL = {
|
||||
"name": "save_document",
|
||||
"description": (
|
||||
"Render markdown content to docx or pdf and save it to Aaron's Nextcloud "
|
||||
"Drafts/ folder (syncs to his other devices and web UI). Use this when "
|
||||
"Aaron asks for a document file rather than chat text — bios, cover "
|
||||
"letters, statements, CV sections, anything he'll edit or send. Returns "
|
||||
"the saved filename. Pick a descriptive filename (no extension) like "
|
||||
"'Aaron_Nelson_Bio_Utah_2026-05'. Format is 'docx' for editable drafts, "
|
||||
"'pdf' for typeset/print-ready output. Content should be well-formed "
|
||||
"markdown — # headings, **bold**, *italic*, - bulleted lists. Don't "
|
||||
"embed file content in the chat response too; just call this tool and "
|
||||
"tell Aaron where it landed."
|
||||
),
|
||||
"input_schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"content": {
|
||||
"type": "string",
|
||||
"description": "Document content in markdown.",
|
||||
},
|
||||
"filename": {
|
||||
"type": "string",
|
||||
"description": "Descriptive filename without extension.",
|
||||
},
|
||||
"format": {
|
||||
"type": "string",
|
||||
"enum": ["docx", "pdf"],
|
||||
"description": "Output format.",
|
||||
},
|
||||
},
|
||||
"required": ["content", "filename", "format"],
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def _safe_filename(name: str, ext: str) -> str:
|
||||
"""Strip path components and unsafe chars; force the requested extension."""
|
||||
base = Path(name).name
|
||||
base = _FILENAME_SAFE_RE.sub("_", base).strip().rstrip(".")
|
||||
if not base:
|
||||
base = "untitled"
|
||||
base = Path(base).stem
|
||||
return f"{base}.{ext}"
|
||||
|
||||
|
||||
def _webdav_unique_url(base_url: str, filename: str, auth) -> tuple[str, str]:
|
||||
"""Return a WebDAV URL that doesn't collide with an existing file. Appends
|
||||
_2, _3, ... until PROPFIND returns 404. Matches the convention dream.py uses."""
|
||||
stem = Path(filename).stem
|
||||
suffix = Path(filename).suffix
|
||||
name = filename
|
||||
i = 2
|
||||
while True:
|
||||
url = f"{base_url}/{name}"
|
||||
check = requests.request("PROPFIND", url, auth=auth, timeout=10)
|
||||
if check.status_code == 404:
|
||||
return url, name
|
||||
name = f"{stem}_{i}{suffix}"
|
||||
i += 1
|
||||
if i > 50:
|
||||
raise RuntimeError("could not find a free filename")
|
||||
|
||||
|
||||
def _execute_save_document(tool_input):
|
||||
"""Generate a document via pandoc and PUT it to Nextcloud Drafts/.
|
||||
Returns a user-facing status string for Claude to relay."""
|
||||
if not NEXTCLOUD_PASSWORD:
|
||||
return "save_document: NEXTCLOUD_PASSWORD not configured."
|
||||
|
||||
payload = tool_input or {}
|
||||
content = payload.get("content", "")
|
||||
raw_filename = payload.get("filename", "untitled")
|
||||
fmt = payload.get("format", "docx")
|
||||
|
||||
if not content.strip():
|
||||
return "save_document: empty content, nothing saved."
|
||||
if fmt not in ("docx", "pdf"):
|
||||
return f"save_document: unsupported format {fmt!r}; use 'docx' or 'pdf'."
|
||||
|
||||
safe_name = _safe_filename(raw_filename, fmt)
|
||||
auth = (NEXTCLOUD_USER, NEXTCLOUD_PASSWORD)
|
||||
|
||||
# Ensure Drafts/ exists. 201 = created, 405 = already there — both fine.
|
||||
try:
|
||||
requests.request("MKCOL", DRAFTS_WEBDAV, auth=auth, timeout=10)
|
||||
except requests.RequestException as e:
|
||||
return f"save_document: could not reach Nextcloud ({e})."
|
||||
|
||||
try:
|
||||
url, final_name = _webdav_unique_url(DRAFTS_WEBDAV, safe_name, auth)
|
||||
except (requests.RequestException, RuntimeError) as e:
|
||||
return f"save_document: filename probe failed ({e})."
|
||||
|
||||
cmd = ["pandoc", "-f", "markdown", "-t", fmt, "-o", "-"]
|
||||
if fmt == "pdf":
|
||||
cmd.insert(-2, "--pdf-engine=xelatex")
|
||||
try:
|
||||
proc = subprocess.run(
|
||||
cmd, input=content.encode("utf-8"),
|
||||
capture_output=True, timeout=120,
|
||||
)
|
||||
except subprocess.TimeoutExpired:
|
||||
return "save_document: pandoc timed out (>120s)."
|
||||
except FileNotFoundError:
|
||||
return ("save_document: pandoc binary not reachable from the api process "
|
||||
"(check that PATH in aaronai.service includes /usr/bin).")
|
||||
if proc.returncode != 0:
|
||||
err = proc.stderr.decode("utf-8", errors="replace")[:400]
|
||||
return f"save_document: pandoc failed: {err}"
|
||||
|
||||
try:
|
||||
put = requests.put(url, data=proc.stdout, auth=auth, timeout=60)
|
||||
except requests.RequestException as e:
|
||||
return f"save_document: WebDAV upload failed ({e})."
|
||||
if put.status_code not in (200, 201, 204):
|
||||
return f"save_document: WebDAV upload returned {put.status_code}."
|
||||
|
||||
return f"Saved to Nextcloud: Drafts/{final_name}"
|
||||
|
||||
|
||||
RETRIEVE_DOCUMENTS_TOOL = {
|
||||
"name": "retrieve_documents",
|
||||
"description": (
|
||||
"Search Aaron's knowledge base — personal documents, reading library, "
|
||||
"conversation transcripts, and journal entries — for content relevant "
|
||||
"to a query. Call whenever you need concrete information you don't "
|
||||
"already have from the persistent memory file. For compound questions "
|
||||
"(e.g. 'bio emphasizing consulting work and recent research'), call "
|
||||
"this tool multiple times with different concrete queries; one call "
|
||||
"per distinct information need. Prefer specific named entities, "
|
||||
"project names, course codes, or topic-specific terms over abstract "
|
||||
"instructional phrasing — 'FWN3D consulting' retrieves better than "
|
||||
"'my work'. Results are ranked by semantic + lexical hybrid retrieval "
|
||||
"and a cross-encoder reranker; no taxonomy is applied, so judge each "
|
||||
"returned chunk on its own merits and ignore irrelevant hits."
|
||||
),
|
||||
"input_schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"query": {
|
||||
"type": "string",
|
||||
"description": "The search query. Use concrete terms.",
|
||||
},
|
||||
},
|
||||
"required": ["query"],
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def _execute_retrieve_documents(tool_input):
|
||||
"""Run retrieve_context for a tool call. Returns (tool_result_text, sources)."""
|
||||
query = (tool_input or {}).get("query", "").strip()
|
||||
if not query:
|
||||
return ("No query provided.", [])
|
||||
pieces, sources = retrieve_context(query)
|
||||
if not pieces:
|
||||
return (f"No results for query={query!r}.", [])
|
||||
parts = []
|
||||
for i, (piece, src) in enumerate(zip(pieces, sources), 1):
|
||||
parts.append(f"[{i}] Source: {src}\n{piece}")
|
||||
return ("\n\n---\n\n".join(parts), sources)
|
||||
|
||||
|
||||
def chat(user_message, conversation_id, settings, client_time=None):
|
||||
memory = load_memory()
|
||||
context_pieces, sources = retrieve_context(user_message)
|
||||
history = get_conversation_history(conversation_id)
|
||||
|
||||
context_parts = []
|
||||
if client_time:
|
||||
context_parts.append(f"Current time (user-supplied, not logged): {client_time}")
|
||||
# System prompt + persistent memory are stable across the tool_use round-trip
|
||||
# and across turns within the 5-minute cache TTL. Putting cache_control on the
|
||||
# last system block creates a cache breakpoint here — the second LLM call in a
|
||||
# tool_use turn reads this prefix from cache (~10% of standard input cost)
|
||||
# instead of re-billing it. Memory lives here (not in the user message) so its
|
||||
# position stays stable for cache hits.
|
||||
system_blocks = [{"type": "text", "text": SYSTEM_PROMPT}]
|
||||
if memory:
|
||||
context_parts.append(f"Aaron's persistent memory:\n\n{memory}")
|
||||
if context_pieces:
|
||||
context_str = "\n\n---\n\n".join(context_pieces)
|
||||
unique_sources = list(set(sources))
|
||||
context_parts.append(
|
||||
f"Relevant excerpts from Aaron's documents:\n\n{context_str}\n\nSources: {', '.join(unique_sources)}"
|
||||
system_blocks.append({
|
||||
"type": "text",
|
||||
"text": f"Aaron's persistent memory:\n\n{memory}",
|
||||
})
|
||||
system_blocks[-1]["cache_control"] = {"type": "ephemeral"}
|
||||
|
||||
# client_time is per-turn dynamic, so it stays out of the cached prefix.
|
||||
if client_time:
|
||||
full_message = (
|
||||
f"Current time (user-supplied, not logged): {client_time}\n\n"
|
||||
f"---\n\n{user_message}"
|
||||
)
|
||||
context_block = "\n\n====\n\n".join(context_parts) + "\n\n---\n\n" if context_parts else ""
|
||||
full_message = context_block + user_message
|
||||
else:
|
||||
full_message = user_message
|
||||
|
||||
messages = history + [{"role": "user", "content": full_message}]
|
||||
|
||||
tools = [{"type": "web_search_20250305", "name": "web_search"}] if settings.get("web_search", True) else []
|
||||
tools = [RETRIEVE_DOCUMENTS_TOOL, SEARCH_FACTS_TOOL, SAVE_DOCUMENT_TOOL]
|
||||
if settings.get("web_search", True):
|
||||
tools.append({"type": "web_search_20250305", "name": "web_search"})
|
||||
|
||||
accumulated_sources = []
|
||||
retrieval_count = 0
|
||||
|
||||
while True:
|
||||
kwargs = {
|
||||
"model": "claude-sonnet-4-6",
|
||||
"max_tokens": 2048,
|
||||
"system": SYSTEM_PROMPT,
|
||||
"messages": messages
|
||||
}
|
||||
if tools:
|
||||
kwargs["tools"] = tools
|
||||
|
||||
response = anthropic_client.messages.create(**kwargs)
|
||||
response = anthropic_client.messages.create(
|
||||
model="claude-sonnet-4-6",
|
||||
max_tokens=2048,
|
||||
system=system_blocks,
|
||||
messages=messages,
|
||||
tools=tools,
|
||||
)
|
||||
|
||||
if response.stop_reason == "tool_use":
|
||||
messages.append({"role": "assistant", "content": response.content})
|
||||
tool_results = []
|
||||
for block in response.content:
|
||||
if block.type == "tool_use":
|
||||
if block.type != "tool_use":
|
||||
continue
|
||||
if block.name == "retrieve_documents":
|
||||
if retrieval_count >= MAX_RETRIEVALS_PER_TURN:
|
||||
result_text = (
|
||||
f"Retrieval budget exhausted "
|
||||
f"({MAX_RETRIEVALS_PER_TURN} calls used this turn). "
|
||||
"Answer with the information you already have or "
|
||||
"tell Aaron you need a more focused question."
|
||||
)
|
||||
else:
|
||||
result_text, result_sources = _execute_retrieve_documents(block.input)
|
||||
accumulated_sources.extend(result_sources)
|
||||
retrieval_count += 1
|
||||
tool_results.append({
|
||||
"type": "tool_result",
|
||||
"tool_use_id": block.id,
|
||||
"content": "Search completed"
|
||||
"content": result_text,
|
||||
})
|
||||
elif block.name == "search_facts":
|
||||
result_text = _execute_search_facts(block.input)
|
||||
tool_results.append({
|
||||
"type": "tool_result",
|
||||
"tool_use_id": block.id,
|
||||
"content": result_text,
|
||||
})
|
||||
elif block.name == "save_document":
|
||||
result_text = _execute_save_document(block.input)
|
||||
tool_results.append({
|
||||
"type": "tool_result",
|
||||
"tool_use_id": block.id,
|
||||
"content": result_text,
|
||||
})
|
||||
else:
|
||||
tool_results.append({
|
||||
"type": "tool_result",
|
||||
"tool_use_id": block.id,
|
||||
"content": "Search completed",
|
||||
})
|
||||
messages.append({"role": "user", "content": tool_results})
|
||||
else:
|
||||
@@ -335,7 +816,18 @@ def chat(user_message, conversation_id, settings, client_time=None):
|
||||
for block in response.content:
|
||||
if hasattr(block, "text"):
|
||||
assistant_message += block.text
|
||||
return assistant_message, list(set(sources))
|
||||
# Async fire-and-forget into Graphiti so the turn lands in the
|
||||
# graph as a single episode for future search_facts queries to
|
||||
# find. Takes ~20 min wall-clock in the background; chat returns
|
||||
# immediately. Disable via SKIP_GRAPHITI_CHAT_PUSH=1 if needed.
|
||||
_push_chat_turn_to_graphiti(conversation_id, user_message, assistant_message)
|
||||
# Cap citations: accumulated_sources can grow large across multiple
|
||||
# retrieve_documents calls and not every chunk that came back was
|
||||
# actually used in the answer. Insertion order preserves rank
|
||||
# (each call returns chunks reranker-ordered, so the earliest
|
||||
# entries are the highest-relevance from the most direct queries).
|
||||
deduped = list(dict.fromkeys(accumulated_sources))
|
||||
return assistant_message, deduped[:MAX_CITED_SOURCES]
|
||||
|
||||
from contextlib import asynccontextmanager
|
||||
|
||||
@@ -365,7 +857,7 @@ async def login(request: Request, response: Response):
|
||||
httponly=True,
|
||||
secure=True,
|
||||
samesite="lax",
|
||||
max_age=60 * 60 * 24 * 30
|
||||
max_age=SESSION_MAX_AGE_SECONDS
|
||||
)
|
||||
response.body = b'{"ok": true}'
|
||||
response.status_code = 200
|
||||
@@ -409,7 +901,7 @@ async def update_settings(request: Request, auth: str = Depends(require_auth)):
|
||||
|
||||
@app.get("/api/conversations")
|
||||
async def list_conversations(auth: str = Depends(require_auth)):
|
||||
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||
conn = _connect_conversations()
|
||||
c = conn.cursor()
|
||||
c.execute('''SELECT id, title, created_at, updated_at, message_count
|
||||
FROM conversations ORDER BY updated_at DESC LIMIT 100''')
|
||||
@@ -429,7 +921,7 @@ async def new_conversation(request: Request, auth: str = Depends(require_auth)):
|
||||
|
||||
@app.get("/api/conversations/{conv_id}/messages")
|
||||
async def get_messages(conv_id: str, auth: str = Depends(require_auth)):
|
||||
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||
conn = _connect_conversations()
|
||||
c = conn.cursor()
|
||||
c.execute('''SELECT role, content, sources, timestamp FROM messages
|
||||
WHERE conversation_id = ? ORDER BY timestamp ASC''', (conv_id,))
|
||||
@@ -446,7 +938,7 @@ async def rename_conversation(conv_id: str, request: Request, auth: str = Depend
|
||||
title = data.get("title", "")
|
||||
if not title:
|
||||
return JSONResponse({"error": "Title required"}, status_code=400)
|
||||
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||
conn = _connect_conversations()
|
||||
c = conn.cursor()
|
||||
c.execute("UPDATE conversations SET title = ? WHERE id = ?", (title, conv_id))
|
||||
conn.commit()
|
||||
@@ -455,7 +947,7 @@ async def rename_conversation(conv_id: str, request: Request, auth: str = Depend
|
||||
|
||||
@app.delete("/api/conversations/{conv_id}")
|
||||
async def delete_conversation(conv_id: str, auth: str = Depends(require_auth)):
|
||||
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||
conn = _connect_conversations()
|
||||
c = conn.cursor()
|
||||
c.execute("DELETE FROM messages WHERE conversation_id = ?", (conv_id,))
|
||||
c.execute("DELETE FROM conversations WHERE id = ?", (conv_id,))
|
||||
@@ -500,14 +992,14 @@ async def chat_endpoint(request: Request, auth: str = Depends(require_auth)):
|
||||
save_message(conversation_id, "user", user_message)
|
||||
|
||||
# Auto-title conversation from first message
|
||||
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||
conn = _connect_conversations()
|
||||
c = conn.cursor()
|
||||
c.execute("SELECT message_count, title FROM conversations WHERE id = ?", (conversation_id,))
|
||||
row = c.fetchone()
|
||||
conn.close()
|
||||
if row and row[0] <= 1 and row[1] == "New conversation":
|
||||
auto_title = user_message[:60] + ("..." if len(user_message) > 60 else "")
|
||||
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||
conn = _connect_conversations()
|
||||
c = conn.cursor()
|
||||
c.execute("UPDATE conversations SET title = ? WHERE id = ?", (auto_title, conversation_id))
|
||||
conn.commit()
|
||||
@@ -587,7 +1079,7 @@ async def get_status(auth: str = Depends(require_auth)):
|
||||
pass
|
||||
|
||||
# Conversation count
|
||||
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||
conn = _connect_conversations()
|
||||
c = conn.cursor()
|
||||
c.execute("SELECT COUNT(*) FROM conversations")
|
||||
conv_count = c.fetchone()[0]
|
||||
@@ -623,6 +1115,7 @@ async def transcribe_audio(request: Request, audio: UploadFile = File(...), auth
|
||||
tmp_path,
|
||||
language="en",
|
||||
vad_filter=True,
|
||||
beam_size=1,
|
||||
initial_prompt=WHISPER_PROMPT
|
||||
)
|
||||
transcript = " ".join(s.text.strip() for s in segments)
|
||||
@@ -669,44 +1162,92 @@ async def run_dreamer(request: Request, auth: str = Depends(require_auth)):
|
||||
return JSONResponse({"started": False, "error": str(e)})
|
||||
|
||||
def transcribe_and_save(tmp_path, timestamp, nextcloud_url, nextcloud_user, nextcloud_password):
|
||||
"""Background task — transcribes audio and saves to Nextcloud after endpoint returns."""
|
||||
"""Background task — transcribes audio and saves to Nextcloud after endpoint returns.
|
||||
Audio is preserved in Journal/Media/ on every terminal path; failed and empty-transcript
|
||||
captures still produce a markdown record in Journal/Captures/ with a status field."""
|
||||
import requests as req_lib
|
||||
nc_auth = (nextcloud_user, nextcloud_password)
|
||||
month_dir = timestamp[:7]
|
||||
audio_ext = os.path.splitext(tmp_path)[1] or ".webm"
|
||||
audio_filename = f"{timestamp}-voice{audio_ext}"
|
||||
audio_relpath = f"Journal/Media/{month_dir}/{audio_filename}"
|
||||
|
||||
def archive_audio() -> bool:
|
||||
try:
|
||||
segments, _ = whisper_model.transcribe(
|
||||
tmp_path, language="en", vad_filter=True, initial_prompt=WHISPER_PROMPT
|
||||
)
|
||||
transcript = " ".join(s.text.strip() for s in segments).strip()
|
||||
os.unlink(tmp_path)
|
||||
if not transcript:
|
||||
print(f"Async transcription empty for {timestamp} — nothing saved")
|
||||
return
|
||||
filename = f"{timestamp}-voice.md"
|
||||
content_md = f"# Capture — {timestamp}\n\n**type:** voice\n**modality:** audio\n**status:** unprocessed\n\n---\n\n{transcript}\n"
|
||||
captures_dir = f"{nextcloud_url}/remote.php/dav/files/{nextcloud_user}/Journal/Captures"
|
||||
req_lib.request("MKCOL", captures_dir, auth=nc_auth, timeout=10)
|
||||
url = f"{captures_dir}/{filename}"
|
||||
req_lib.put(url, data=content_md.encode("utf-8"), auth=nc_auth, timeout=30)
|
||||
print(f"Async transcription saved: {filename}")
|
||||
# Notify SSE clients that transcription is complete
|
||||
try:
|
||||
import requests as _req
|
||||
_req.post("http://localhost:8000/api/events/notify", json={
|
||||
"type": "capture_saved",
|
||||
"filename": filename,
|
||||
"timestamp": timestamp,
|
||||
}, timeout=3)
|
||||
_req.post("http://localhost:8000/api/captures/events/notify", json={
|
||||
"type": "capture_saved",
|
||||
"filename": filename,
|
||||
"timestamp": timestamp,
|
||||
}, timeout=3)
|
||||
except Exception:
|
||||
pass
|
||||
with open(tmp_path, "rb") as f:
|
||||
audio_bytes = f.read()
|
||||
media_parent = f"{nextcloud_url}/remote.php/dav/files/{nextcloud_user}/Journal/Media"
|
||||
media_dir = f"{media_parent}/{month_dir}"
|
||||
req_lib.request("MKCOL", media_parent, auth=nc_auth, timeout=10)
|
||||
req_lib.request("MKCOL", media_dir, auth=nc_auth, timeout=10)
|
||||
req_lib.put(f"{media_dir}/{audio_filename}", data=audio_bytes, auth=nc_auth, timeout=60)
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"Audio archival failed for {timestamp}: {e}")
|
||||
return False
|
||||
finally:
|
||||
if os.path.exists(tmp_path):
|
||||
os.unlink(tmp_path)
|
||||
print(f"Async transcription failed for {timestamp}: {e}")
|
||||
|
||||
def write_capture(filename: str, content_md: str, status: str):
|
||||
captures_dir = f"{nextcloud_url}/remote.php/dav/files/{nextcloud_user}/Journal/Captures"
|
||||
try:
|
||||
req_lib.request("MKCOL", captures_dir, auth=nc_auth, timeout=10)
|
||||
req_lib.put(f"{captures_dir}/{filename}", data=content_md.encode("utf-8"), auth=nc_auth, timeout=30)
|
||||
except Exception as e:
|
||||
print(f"Capture markdown write failed for {timestamp}: {e}")
|
||||
return
|
||||
try:
|
||||
payload = {"type": "capture_saved", "filename": filename, "timestamp": timestamp, "status": status}
|
||||
req_lib.post("http://localhost:8000/api/events/notify", json=payload, timeout=3)
|
||||
req_lib.post("http://localhost:8000/api/captures/events/notify", json=payload, timeout=3)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
transcript = ""
|
||||
transcribe_error = None
|
||||
try:
|
||||
segments, _ = whisper_model.transcribe(
|
||||
tmp_path, language="en", vad_filter=True, beam_size=1, initial_prompt=WHISPER_PROMPT
|
||||
)
|
||||
transcript = " ".join(s.text.strip() for s in segments).strip()
|
||||
except Exception as e:
|
||||
transcribe_error = str(e)
|
||||
|
||||
audio_archived = archive_audio()
|
||||
audio_line = f"**audio_path:** {audio_relpath}\n" if audio_archived else "**audio_archive_failed:** true\n"
|
||||
|
||||
if transcribe_error is not None:
|
||||
filename = f"{timestamp}-voice-failed.md"
|
||||
content_md = (
|
||||
f"# Capture — {timestamp}\n\n"
|
||||
f"**type:** voice\n**modality:** audio\n**status:** failed_transcription\n"
|
||||
f"{audio_line}"
|
||||
f"**error:** {transcribe_error}\n"
|
||||
)
|
||||
write_capture(filename, content_md, "failed_transcription")
|
||||
print(f"Async transcription failed for {timestamp}: {transcribe_error}")
|
||||
return
|
||||
|
||||
if not transcript:
|
||||
filename = f"{timestamp}-voice-empty.md"
|
||||
content_md = (
|
||||
f"# Capture — {timestamp}\n\n"
|
||||
f"**type:** voice\n**modality:** audio\n**status:** empty_transcript\n"
|
||||
f"{audio_line}"
|
||||
)
|
||||
write_capture(filename, content_md, "empty_transcript")
|
||||
print(f"Async transcription empty for {timestamp}: audio archived")
|
||||
return
|
||||
|
||||
filename = f"{timestamp}-voice.md"
|
||||
content_md = (
|
||||
f"# Capture — {timestamp}\n\n"
|
||||
f"**type:** voice\n**modality:** audio\n**status:** saved\n"
|
||||
f"{audio_line}\n---\n\n{transcript}\n"
|
||||
)
|
||||
write_capture(filename, content_md, "saved")
|
||||
print(f"Async transcription saved: {filename}")
|
||||
|
||||
|
||||
@app.post("/api/capture")
|
||||
@@ -760,7 +1301,7 @@ async def capture_endpoint(
|
||||
tmp.write(audio_bytes)
|
||||
tmp_audio_path = tmp.name
|
||||
segments, _ = whisper_model.transcribe(
|
||||
tmp_audio_path, language="en", vad_filter=True, initial_prompt=WHISPER_PROMPT
|
||||
tmp_audio_path, language="en", vad_filter=True, beam_size=1, initial_prompt=WHISPER_PROMPT
|
||||
)
|
||||
voice_annotation = " ".join(s.text.strip() for s in segments).strip() or None
|
||||
os.unlink(tmp_audio_path)
|
||||
@@ -813,7 +1354,7 @@ Keep the full description to 150-250 words. Do not speculate beyond what is visi
|
||||
|
||||
**type:** {capture_type}
|
||||
**modality:** {modality}
|
||||
**status:** unprocessed
|
||||
**status:** saved
|
||||
**media:** {media_path}
|
||||
{f"**project:** {project}" if project else ""}
|
||||
|
||||
@@ -969,7 +1510,7 @@ async def reindex_status(auth: str = Depends(require_auth)):
|
||||
|
||||
@app.delete("/api/conversations")
|
||||
async def clear_all_conversations(auth: str = Depends(require_auth)):
|
||||
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||
conn = _connect_conversations()
|
||||
c = conn.cursor()
|
||||
c.execute("DELETE FROM messages")
|
||||
c.execute("DELETE FROM conversations")
|
||||
|
||||
@@ -0,0 +1,128 @@
|
||||
"""One-off: backfill last_consolidated_at + consolidation_count on embeddings
|
||||
from the dream-manifest-*.json files already in Journal/Dreams/.
|
||||
|
||||
Why this exists: the consolidation cursor columns added by the dreamer
|
||||
redesign migration default to NULL / 0. Without history, the
|
||||
underprocessed-count signal in dream_observation.observe_corpus() reports
|
||||
"every chunk is underprocessed" (degenerate percentile), and NREM has no
|
||||
basis to bias replay toward least-recently-consolidated chunks.
|
||||
|
||||
We have ~25 historical dream manifests in Nextcloud/Journal/Dreams/, each
|
||||
listing the sources retrieved per stage. For each (manifest, source) pair
|
||||
this script:
|
||||
- finds matching embeddings rows by source (basename match)
|
||||
- increments consolidation_count by 1
|
||||
- updates last_consolidated_at to the manifest date (UTC midnight)
|
||||
|
||||
Idempotent: re-running will not double-count because we drop existing
|
||||
cursor values to NULL/0 before backfilling. Pass --dry-run to print what
|
||||
would change without writing.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
from dotenv import load_dotenv
|
||||
import psycopg2
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
DREAMS_DIR = Path("/home/aaron/nextcloud/data/data/aaron/files/Journal/Dreams")
|
||||
DRY_RUN = "--dry-run" in sys.argv
|
||||
|
||||
|
||||
def get_pg():
|
||||
return psycopg2.connect(PG_DSN)
|
||||
|
||||
|
||||
def collect_manifest_records():
|
||||
"""Return a list of (source_basename, manifest_date_utc) tuples from all
|
||||
dream-manifest-*.json files. One pair per (manifest, source) appearance."""
|
||||
pairs = []
|
||||
if not DREAMS_DIR.exists():
|
||||
return pairs
|
||||
for path in sorted(DREAMS_DIR.glob("dream-manifest-*.json")):
|
||||
try:
|
||||
m = json.loads(path.read_text())
|
||||
except Exception as e:
|
||||
print(f" skip {path.name}: {e}")
|
||||
continue
|
||||
date_str = m.get("date")
|
||||
if not date_str:
|
||||
continue
|
||||
try:
|
||||
dt = datetime.fromisoformat(date_str).replace(tzinfo=timezone.utc)
|
||||
except ValueError:
|
||||
continue
|
||||
stages = m.get("stages") or {}
|
||||
for stage_name in ("nrem", "early_rem", "late_rem", "synthesis"):
|
||||
stage = stages.get(stage_name) or {}
|
||||
for src in (stage.get("sources") or []):
|
||||
if src:
|
||||
pairs.append((src, dt))
|
||||
return pairs
|
||||
|
||||
|
||||
def main():
|
||||
print(f"Mode: {'DRY-RUN' if DRY_RUN else 'APPLY'}")
|
||||
print(f"Scanning manifests in {DREAMS_DIR}")
|
||||
pairs = collect_manifest_records()
|
||||
print(f"Collected {len(pairs)} (source, manifest_date) pairs across all manifests")
|
||||
if not pairs:
|
||||
print("Nothing to backfill.")
|
||||
return
|
||||
|
||||
# Aggregate per source: count + latest date
|
||||
from collections import defaultdict
|
||||
counts = defaultdict(int)
|
||||
latest = {}
|
||||
for src, dt in pairs:
|
||||
counts[src] += 1
|
||||
if src not in latest or dt > latest[src]:
|
||||
latest[src] = dt
|
||||
print(f"Unique sources to update: {len(counts)}")
|
||||
|
||||
# Sample what we'd write
|
||||
print("Sample (top 5 by appearance count):")
|
||||
for src, n in sorted(counts.items(), key=lambda kv: -kv[1])[:5]:
|
||||
print(f" {n:>3} appearances — {src} → last_consolidated_at = {latest[src].date()}")
|
||||
|
||||
if DRY_RUN:
|
||||
print("\nDry-run only. Re-run without --dry-run to apply.")
|
||||
return
|
||||
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
|
||||
# Reset cursor for any sources we're about to backfill so reruns are clean.
|
||||
print("\nResetting cursor for sources we'll touch...")
|
||||
sources = list(counts.keys())
|
||||
cur.execute(
|
||||
"UPDATE embeddings SET last_consolidated_at = NULL, consolidation_count = 0 "
|
||||
"WHERE source = ANY(%s)",
|
||||
(sources,),
|
||||
)
|
||||
print(f" reset {cur.rowcount} embeddings rows")
|
||||
|
||||
# Apply per-source updates. For each source, set count and latest date.
|
||||
print("Applying per-source backfill...")
|
||||
updated_rows = 0
|
||||
for src, n in counts.items():
|
||||
cur.execute(
|
||||
"UPDATE embeddings "
|
||||
"SET consolidation_count = %s, last_consolidated_at = %s "
|
||||
"WHERE source = %s",
|
||||
(n, latest[src], src),
|
||||
)
|
||||
updated_rows += cur.rowcount
|
||||
pg.commit()
|
||||
pg.close()
|
||||
print(f"Done. Updated {updated_rows} embeddings rows across {len(counts)} unique sources.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
+1
-1
@@ -6,7 +6,7 @@ mkdir -p "$BACKUP_DIR"
|
||||
# Copy critical files
|
||||
cp ~/aaronai/memory.md "$BACKUP_DIR/memory-$DATE.md"
|
||||
cp ~/aaronai/settings.json "$BACKUP_DIR/settings-$DATE.json"
|
||||
cp ~/aaronai/conversations.db "$BACKUP_DIR/conversations-$DATE.db"
|
||||
python3 -c "import sqlite3, sys; src = sqlite3.connect('$HOME/aaronai/conversations.db'); dst = sqlite3.connect('$BACKUP_DIR/conversations-$DATE.db'); src.backup(dst); dst.close(); src.close()"
|
||||
|
||||
# Keep only last 7 days
|
||||
find "$BACKUP_DIR" -name "*.md" -mtime +7 -delete
|
||||
|
||||
+387
-81
@@ -16,11 +16,14 @@ import os
|
||||
import json
|
||||
import sqlite3
|
||||
import argparse
|
||||
from functools import lru_cache
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
from datetime import datetime, timedelta
|
||||
from dotenv import load_dotenv
|
||||
import psycopg2
|
||||
import hashlib
|
||||
import numpy as np
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
@@ -40,6 +43,26 @@ NEXTCLOUD_USER = os.getenv("NEXTCLOUD_USER", "aaron")
|
||||
NEXTCLOUD_PASSWORD = os.getenv("NEXTCLOUD_PASSWORD", "")
|
||||
DREAMS_WEBDAV = f"{NEXTCLOUD_URL}/remote.php/dav/files/{NEXTCLOUD_USER}/Journal/Dreams"
|
||||
|
||||
# ─── Retrieval-window config (per dreamer-multimodal-design.md §2) ─────────
|
||||
# Biological grounding: NREM replays recent traces (24-72 hrs); REM links
|
||||
# across time on structural similarity, not temporal proximity. Synthesis
|
||||
# pulls from salience across the full corpus (no window). Spec calls for
|
||||
# these to be mutable rather than hardcoded — this is the mutable home.
|
||||
TIME_WINDOWS_HOURS = {
|
||||
"nrem": 72, # 24-72 hrs, take wider end
|
||||
"early-rem": 24 * 30, # 30 days
|
||||
"late-rem": 24 * 90, # 90 days
|
||||
"lucid": None, # no window
|
||||
}
|
||||
|
||||
# Maximal Marginal Relevance: λ=1 → pure relevance, λ=0 → pure diversity.
|
||||
# 0.5 is the standard balance; tune later if the dossier-cluster problem
|
||||
# isn't sufficiently broken up.
|
||||
MMR_LAMBDA = 0.5
|
||||
|
||||
# Fast/cheap model for query generation. Sonnet for synthesis (in synthesize_*).
|
||||
LLM_QUERY_MODEL = os.getenv("DREAMER_QUERY_MODEL", "claude-haiku-4-5-20251001")
|
||||
|
||||
# Similarity ranges calibrated for all-MiniLM-L6-v2
|
||||
MODE_RANGES = {
|
||||
"nrem": (0.48, 0.72),
|
||||
@@ -282,68 +305,298 @@ def retrieve_graphiti(mode, task=None, n_results=8, excluded_sources=None):
|
||||
print(f"[Graphiti retrieval error: {e}] — falling back to empty.")
|
||||
return []
|
||||
|
||||
def retrieve(mode, task=None, n_results=8, excluded_sources=None):
|
||||
# E3 experiment: DREAMER_SUBSTRATE=graphiti routes retrieval to Graphiti /search
|
||||
# Default behavior: pgvector similarity search (unchanged)
|
||||
substrate = os.getenv("DREAMER_SUBSTRATE", "pgvector")
|
||||
if substrate == "graphiti":
|
||||
return retrieve_graphiti(mode, task=task, n_results=n_results, excluded_sources=excluded_sources)
|
||||
@lru_cache(maxsize=1)
|
||||
def _get_embedder():
|
||||
from sentence_transformers import SentenceTransformer
|
||||
embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
||||
low, high = MODE_RANGES[mode]
|
||||
return SentenceTransformer("all-MiniLM-L6-v2")
|
||||
|
||||
def _llm_generate_queries(mode, signal, task=None, n_queries=4):
|
||||
"""Park et al. 2023 reflection-style query generation. Feeds the LLM the
|
||||
observation signal + a mode-specific framing; emits N retrieval queries
|
||||
that probe different corners of the recent corpus instead of the same
|
||||
hardcoded string every night. Sources cited in dream_observation.py.
|
||||
|
||||
Falls back to recent_questions from the signal if the LLM call fails."""
|
||||
import anthropic
|
||||
|
||||
if task:
|
||||
query = task
|
||||
elif mode == "late-rem":
|
||||
delta = observe_corpus()
|
||||
topics = delta.get("recent_topics", [])
|
||||
query = topics[0] if topics else "practice place memory making"
|
||||
elif mode == "early-rem":
|
||||
query = "career decision personal change what matters next"
|
||||
# Lucid mode: decompose the user's task into sub-queries
|
||||
prompt = (
|
||||
f"Decompose this user task into {n_queries} distinct sub-questions, "
|
||||
f"each suitable as a retrieval query against Aaron's personal corpus.\n\n"
|
||||
f"TASK: {task}\n\n"
|
||||
f'Output JSON ONLY: {{"queries": ["...", "...", ...]}}'
|
||||
)
|
||||
else:
|
||||
query = "research fabrication teaching practice recent work"
|
||||
mode_framings = {
|
||||
"nrem": (
|
||||
"NREM is replay-and-consolidation of RECENT traces. Generate queries "
|
||||
"that probe what Aaron has been working on or capturing in the last "
|
||||
"few days. Concrete entities — project names, course codes, named "
|
||||
"subjects. The dreamer is re-touching specific recent material to "
|
||||
"strengthen schema connections, not finding novel content."
|
||||
),
|
||||
"early-rem": (
|
||||
"Early REM is associative bridging with emotional/personal register. "
|
||||
"Generate queries that surface unresolved themes, career questions, "
|
||||
"ongoing personal threads — material that connects intellectual and "
|
||||
"emotional dimensions. Tone: thoughtful friend, not researcher."
|
||||
),
|
||||
"late-rem": (
|
||||
"Late REM tests novel connections across DISTANT material. Generate "
|
||||
"queries that pair concrete subjects from DIFFERENT domains of Aaron's "
|
||||
"work (e.g., one from academic teaching, one from consulting, one from "
|
||||
"creative practice) to probe for surprising structural similarity. "
|
||||
"Cross-domain is required."
|
||||
),
|
||||
}
|
||||
framing = mode_framings.get(mode, mode_framings["nrem"])
|
||||
questions_snippet = "\n".join(
|
||||
f" - {q[:200]}" for q in signal.get("recent_questions", [])[:8]
|
||||
) or " (no recent user questions)"
|
||||
journal_snippet = ", ".join(signal.get("new_journal_entries", [])[:5]) or "(none)"
|
||||
days_str = (
|
||||
f"{signal['days_since_dream']:.1f}"
|
||||
if signal.get("days_since_dream") not in (None, float("inf"))
|
||||
else "infinite (first dream)"
|
||||
)
|
||||
prompt = (
|
||||
f"You generate retrieval queries for an Active Inference dreamer. The "
|
||||
f"dreamer surfaces prediction errors — gaps between Aaron's model and "
|
||||
f"reality — not summaries or generic associations.\n\n"
|
||||
f"MODE: {mode}\n"
|
||||
f"FRAMING: {framing}\n\n"
|
||||
f"OBSERVATION SIGNAL:\n"
|
||||
f"- Days since last dream: {days_str}\n"
|
||||
f"- New chunks since last dream: {signal.get('new_chunks', 0)}\n"
|
||||
f"- New journal entries: {journal_snippet}\n"
|
||||
f"- Underprocessed chunks pool: {signal.get('underprocessed_count', 0):,}\n\n"
|
||||
f"RECENT USER QUESTIONS (last 14 days, top 8):\n{questions_snippet}\n\n"
|
||||
f"Generate {n_queries} retrieval queries. Requirements:\n"
|
||||
f"- Use concrete entities, named projects, course codes, specific topics "
|
||||
f"— NOT generic phrasing like 'research work practice'\n"
|
||||
f"- Each query probes a DIFFERENT corner of recent activity\n"
|
||||
f"- Match the {mode} framing\n"
|
||||
f"- 5-15 words each\n\n"
|
||||
f'Output JSON ONLY: {{"queries": ["...", "...", ...]}}'
|
||||
)
|
||||
|
||||
embedding = embedder.encode([query]).tolist()[0]
|
||||
chunks = []
|
||||
seen_sources = set()
|
||||
try:
|
||||
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
|
||||
resp = client.messages.create(
|
||||
model=LLM_QUERY_MODEL,
|
||||
max_tokens=512,
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
)
|
||||
text = "".join(b.text for b in resp.content if hasattr(b, "text")).strip()
|
||||
if text.startswith("```"):
|
||||
text = text.split("```", 2)[1]
|
||||
if text.startswith("json"):
|
||||
text = text[4:]
|
||||
text = text.strip()
|
||||
data = json.loads(text)
|
||||
queries = data.get("queries", [])
|
||||
if isinstance(queries, list) and queries:
|
||||
return [str(q).strip() for q in queries[:n_queries] if str(q).strip()]
|
||||
except Exception as e:
|
||||
print(f"[dream] LLM query generation failed ({e}); falling back to recent questions")
|
||||
|
||||
fallback = signal.get("recent_questions", [])[:n_queries] if signal else []
|
||||
return fallback or [task or "recent activity decisions thinking"]
|
||||
|
||||
|
||||
def _mmr_select(candidate_embeddings, query_embedding, n, lambda_=MMR_LAMBDA):
|
||||
"""Maximal Marginal Relevance — greedy selection that balances relevance
|
||||
against pairwise diversity. Carbonell & Goldstein 1998. Used to prevent
|
||||
cluster lock-in (e.g., 8 dossier-narrative variants filling all 8 slots).
|
||||
|
||||
candidate_embeddings: (N, D) numpy array
|
||||
query_embedding: (D,) numpy array
|
||||
Returns: list of indices into candidate_embeddings, len ≤ n."""
|
||||
if len(candidate_embeddings) == 0:
|
||||
return []
|
||||
n = min(n, len(candidate_embeddings))
|
||||
cands = candidate_embeddings / (np.linalg.norm(candidate_embeddings, axis=1, keepdims=True) + 1e-9)
|
||||
q = query_embedding / (np.linalg.norm(query_embedding) + 1e-9)
|
||||
relevance = cands @ q
|
||||
selected = []
|
||||
remaining = list(range(len(cands)))
|
||||
while len(selected) < n and remaining:
|
||||
if not selected:
|
||||
best = max(remaining, key=lambda i: relevance[i])
|
||||
else:
|
||||
sel = cands[selected]
|
||||
scores = {
|
||||
i: lambda_ * relevance[i] - (1 - lambda_) * float((cands[i] @ sel.T).max())
|
||||
for i in remaining
|
||||
}
|
||||
best = max(scores, key=scores.get)
|
||||
selected.append(best)
|
||||
remaining.remove(best)
|
||||
return selected
|
||||
|
||||
|
||||
def _bump_consolidation_cursor(chunks):
|
||||
"""Increment consolidation_count + set last_consolidated_at=NOW() for each
|
||||
source represented in chunks. Called from dream_pipeline after NREM
|
||||
completes. Per sharp-wave-ripples biology, NREM does the actual
|
||||
consolidation; REM is associative use, so we only bump on NREM."""
|
||||
if not chunks:
|
||||
return
|
||||
sources = list({c["source"] for c in chunks if c.get("source")})
|
||||
if not sources:
|
||||
return
|
||||
try:
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
excluded_sources = excluded_sources or set()
|
||||
if excluded_sources:
|
||||
cur.execute("""
|
||||
SELECT document, source, 1 - (embedding <=> %s::vector) as similarity
|
||||
FROM embeddings
|
||||
WHERE source NOT IN %s
|
||||
ORDER BY embedding <=> %s::vector
|
||||
LIMIT %s
|
||||
""", (embedding, tuple(excluded_sources), embedding, n_results * 3))
|
||||
else:
|
||||
cur.execute("""
|
||||
SELECT document, source, 1 - (embedding <=> %s::vector) as similarity
|
||||
FROM embeddings
|
||||
ORDER BY embedding <=> %s::vector
|
||||
LIMIT %s
|
||||
""", (embedding, embedding, n_results * 3))
|
||||
|
||||
for doc, source, similarity in cur.fetchall():
|
||||
if not (low <= similarity <= high):
|
||||
continue
|
||||
if source in seen_sources:
|
||||
continue
|
||||
chunks.append({
|
||||
"source": source or "unknown",
|
||||
"content": doc,
|
||||
"relevance": similarity,
|
||||
"similarity": similarity,
|
||||
})
|
||||
seen_sources.add(source)
|
||||
if len(chunks) >= n_results:
|
||||
break
|
||||
cur.execute(
|
||||
"UPDATE embeddings "
|
||||
"SET consolidation_count = consolidation_count + 1, "
|
||||
" last_consolidated_at = NOW() "
|
||||
"WHERE source = ANY(%s)",
|
||||
(sources,),
|
||||
)
|
||||
pg.commit()
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
print(f"pgvector retrieval error: {e}")
|
||||
print(f"[dream] cursor bump failed (non-fatal): {e}")
|
||||
|
||||
|
||||
def retrieve(mode, task=None, n_results=8, excluded_sources=None,
|
||||
type_filter=None, signal=None):
|
||||
"""Refactored retrieval — see dreamer-design-spec.md Stage 3 + the
|
||||
external-literature prescription in birdai-dreamer-exclusion-finding-2026-05-02.md.
|
||||
|
||||
Changes from the prior hardcoded-query version:
|
||||
- Queries are LLM-generated from the observation signal (Park et al.
|
||||
reflection pattern) instead of fixed strings. Solves the "same 8 sources
|
||||
every night" failure where fixed seeds locked into one neighborhood.
|
||||
- Per-mode time windows (24-72hr NREM / 30d Early REM / 90d Late REM)
|
||||
filter candidates before vector search. Spec calls for these to be
|
||||
mutable; they live in TIME_WINDOWS_HOURS.
|
||||
- NREM biases toward under-processed chunks (low consolidation_count).
|
||||
Biologically motivated: sharp-wave ripples tag what to replay, not
|
||||
uniform sampling.
|
||||
- Multiple queries (4 by default) → over-fetch → MMR merge for
|
||||
within-night diversity. Prevents cluster domination.
|
||||
|
||||
signal is the observation-signal dict from dream_observation.observe_corpus().
|
||||
If None, observe_corpus is called inline (back-compat for ad-hoc invocation).
|
||||
"""
|
||||
# E3 substrate experiment unchanged
|
||||
substrate = os.getenv("DREAMER_SUBSTRATE", "pgvector")
|
||||
if substrate == "graphiti":
|
||||
return retrieve_graphiti(mode, task=task, n_results=n_results,
|
||||
excluded_sources=excluded_sources)
|
||||
|
||||
if signal is None:
|
||||
from dream_observation import observe_corpus as _obs
|
||||
signal = _obs()
|
||||
|
||||
queries = _llm_generate_queries(mode, signal, task=task, n_queries=4)
|
||||
if not queries:
|
||||
print(f"[dream:{mode}] no queries generated; bailing")
|
||||
return []
|
||||
print(f"[dream:{mode}] generated queries: {queries}")
|
||||
|
||||
embedder = _get_embedder()
|
||||
excluded_sources = excluded_sources or set()
|
||||
window_hours = TIME_WINDOWS_HOURS.get(mode)
|
||||
per_query_n = 12 # over-fetch for MMR
|
||||
|
||||
candidates = []
|
||||
seen_ids = set()
|
||||
try:
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
for q in queries:
|
||||
q_emb = embedder.encode([q]).tolist()[0]
|
||||
where, params = [], []
|
||||
if excluded_sources:
|
||||
where.append("source NOT IN %s")
|
||||
params.append(tuple(excluded_sources))
|
||||
if type_filter:
|
||||
where.append("type = ANY(%s)")
|
||||
params.append(list(type_filter))
|
||||
if window_hours is not None:
|
||||
# created_at is TEXT (legacy); cast it. NULL created_at fails
|
||||
# the comparison so legacy rows are excluded from windowed
|
||||
# modes — correct: NULL means "indexed before cursor existed,"
|
||||
# which by definition is older than any window.
|
||||
where.append(
|
||||
f"(created_at IS NOT NULL AND "
|
||||
f"created_at::timestamptz > NOW() - INTERVAL '{int(window_hours)} hours')"
|
||||
)
|
||||
where_clause = ("WHERE " + " AND ".join(where)) if where else ""
|
||||
# NREM bias: order by consolidation_count ASC first (under-processed
|
||||
# chunks win the tiebreak before vector distance). Other modes:
|
||||
# vector distance only.
|
||||
order_clause = (
|
||||
"ORDER BY consolidation_count ASC, embedding <=> %s::vector"
|
||||
if mode == "nrem"
|
||||
else "ORDER BY embedding <=> %s::vector"
|
||||
)
|
||||
cur.execute(f"""
|
||||
SELECT id, document, source, type, embedding,
|
||||
1 - (embedding <=> %s::vector) as similarity
|
||||
FROM embeddings
|
||||
{where_clause}
|
||||
{order_clause}
|
||||
LIMIT %s
|
||||
""", [q_emb, *params, q_emb, per_query_n])
|
||||
for row in cur.fetchall():
|
||||
if row[0] in seen_ids:
|
||||
continue
|
||||
seen_ids.add(row[0])
|
||||
emb = row[4]
|
||||
# pgvector returns embeddings as string "[...]" by default
|
||||
if isinstance(emb, str):
|
||||
emb = np.array([float(x) for x in emb.strip("[]").split(",")])
|
||||
else:
|
||||
emb = np.array(emb)
|
||||
candidates.append({
|
||||
"id": row[0],
|
||||
"content": row[1],
|
||||
"source": row[2] or "unknown",
|
||||
"type": row[3],
|
||||
"embedding": emb,
|
||||
"similarity": float(row[5]),
|
||||
})
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
import traceback
|
||||
print(f"[dream:{mode}] retrieval SQL error: {e}")
|
||||
traceback.print_exc()
|
||||
return []
|
||||
|
||||
if not candidates:
|
||||
print(f"[dream:{mode}] zero candidates after filters")
|
||||
return []
|
||||
|
||||
# MMR over the union, using the first query as pivot for the relevance term.
|
||||
# Averaging query embeddings would be theoretically cleaner but adds
|
||||
# complexity for marginal benefit at this scale.
|
||||
pivot_emb = np.array(embedder.encode([queries[0]]).tolist()[0])
|
||||
cand_embs = np.array([c["embedding"] for c in candidates])
|
||||
selected_idx = _mmr_select(cand_embs, pivot_emb, n=n_results * 2)
|
||||
|
||||
# Post-MMR source-level dedup (multi-chunk same source collapses to one).
|
||||
chunks = []
|
||||
seen_sources = set()
|
||||
for i in selected_idx:
|
||||
c = candidates[i]
|
||||
if c["source"] in seen_sources:
|
||||
continue
|
||||
seen_sources.add(c["source"])
|
||||
chunks.append({
|
||||
"source": c["source"],
|
||||
"content": c["content"],
|
||||
"relevance": c["similarity"],
|
||||
"similarity": c["similarity"],
|
||||
"type": c["type"],
|
||||
})
|
||||
if len(chunks) >= n_results:
|
||||
break
|
||||
|
||||
return chunks
|
||||
|
||||
@@ -476,38 +729,71 @@ def write_manifest(date_str, stage_data, corpus_data):
|
||||
auth = (NEXTCLOUD_USER, NEXTCLOUD_PASSWORD)
|
||||
url = f"{DREAMS_WEBDAV}/dream-manifest-{date_str}.json"
|
||||
try:
|
||||
requests.put(url, data=content.encode("utf-8"), auth=auth, timeout=30)
|
||||
response = requests.put(url, data=content.encode("utf-8"), auth=auth, timeout=30)
|
||||
response.raise_for_status()
|
||||
print(f"Manifest written: Journal/Dreams/dream-manifest-{date_str}.json")
|
||||
except Exception as e:
|
||||
print(f"Manifest write failed (non-critical): {e}")
|
||||
print(f"Manifest write failed — manifest not persisted: {e}")
|
||||
|
||||
|
||||
def dream_pipeline():
|
||||
def dream_pipeline(type_filter=None):
|
||||
"""
|
||||
Full nightly pipeline — interdependent stages.
|
||||
NREM output feeds Early REM. Both feed Late REM. All three feed Synthesis.
|
||||
|
||||
Per dreamer-design-spec.md, this now runs Stage 1 (observe) and Stage 2
|
||||
(select) first. If select_mode returns None — corpus unchanged and no new
|
||||
journal entry — the dreamer goes quiet rather than manufacturing novelty.
|
||||
Otherwise NREM/Early-REM/Late-REM run with LLM-generated queries seeded
|
||||
from the observation signal.
|
||||
"""
|
||||
print(f"Dreamer pipeline starting — {datetime.now().strftime('%Y-%m-%d %H:%M')}")
|
||||
|
||||
state = load_dreamer_state()
|
||||
previously_retrieved = set(state.get("retrieved_sources", []))
|
||||
state.pop("retrieved_sources", None) # legacy key; session-scoped novelty now
|
||||
session_retrieved = set()
|
||||
|
||||
delta = observe_corpus()
|
||||
print(f"Corpus: {delta['new_chunks']} new chunks, {delta['days_since_dream']:.1f} days since last dream")
|
||||
print(f"Excluding {len(previously_retrieved)} previously retrieved sources")
|
||||
# ── Stage 1 + 2: Observe + Select ──────────────────────────────────────
|
||||
from dream_observation import observe_corpus as _obs, select_mode as _select
|
||||
signal = _obs()
|
||||
print(
|
||||
f"Signal: new_chunks={signal['new_chunks']}, "
|
||||
f"new_journal={len(signal['new_journal_entries'])}, "
|
||||
f"days_since={signal['days_since_dream']:.1f}, "
|
||||
f"underprocessed={signal['underprocessed_count']:,}"
|
||||
)
|
||||
selected = _select(signal)
|
||||
if selected is None:
|
||||
print("[select_mode] None — nothing worth dreaming about tonight (going quiet)")
|
||||
# Update last-dream-attempted-at but not last_dream — caller can distinguish
|
||||
# an actual dream from a skipped night by looking at last_dream_file or
|
||||
# checking the manifest dir.
|
||||
state["last_select_quiet_at"] = datetime.now().isoformat()
|
||||
save_dreamer_state(state)
|
||||
return None
|
||||
print(f"[select_mode] → {selected}")
|
||||
|
||||
# ── Stage 1: NREM ──────────────────────────────────────────────────────
|
||||
# The pipeline always runs all three modes for the manifest's continuity.
|
||||
# select_mode's choice signals the *primary* focus; the others still run
|
||||
# but draw from their own mode-appropriate windows.
|
||||
primary_mode = selected
|
||||
|
||||
# ── Stage 3: NREM ──────────────────────────────────────────────────────
|
||||
print("\n[NREM] Retrieving...")
|
||||
# NREM is replay-and-consolidation — does not exclude prior traces.
|
||||
# Late REM and Early REM exclude prior content for novelty; NREM does not.
|
||||
nrem_chunks = retrieve("nrem", excluded_sources=None)
|
||||
nrem_chunks = retrieve("nrem", excluded_sources=None,
|
||||
type_filter=type_filter, signal=signal)
|
||||
session_retrieved.update(c["source"] for c in nrem_chunks)
|
||||
# Track sources that scored above Early REM ceiling — these are the only ones Early REM should exclude
|
||||
nrem_high_sources = {c["source"] for c in nrem_chunks if c["similarity"] > 0.55}
|
||||
if not nrem_chunks:
|
||||
print("[NREM] No suitable chunks — aborting pipeline")
|
||||
return None
|
||||
# Cursor bump: NREM is the consolidation stage. Each appearance increments
|
||||
# consolidation_count + updates last_consolidated_at, so the next dream's
|
||||
# observation sees these sources as less under-processed.
|
||||
_bump_consolidation_cursor(nrem_chunks)
|
||||
|
||||
print(f"[NREM] Retrieved {len(nrem_chunks)} chunks. Synthesizing...")
|
||||
nrem_output = synthesize_nrem(nrem_chunks)
|
||||
@@ -518,11 +804,15 @@ def dream_pipeline():
|
||||
"nrem": {
|
||||
"chunks_retrieved": len(nrem_chunks),
|
||||
"avg_similarity": round(sum(c["relevance"] for c in nrem_chunks) / len(nrem_chunks), 3),
|
||||
"query": "research fabrication teaching practice recent work",
|
||||
"query": "[llm-generated from observation signal]",
|
||||
"word_count": len(nrem_output.split()),
|
||||
"sources": nrem_sources,
|
||||
"distinct_folders": nrem_folders,
|
||||
"folder_count": len(nrem_folders),
|
||||
# Counter filters None: Graphiti chunks lack `type` (facts, not embeddings rows).
|
||||
# Pgvector chunks always carry type post-Improvement-#2 backfill. If type
|
||||
# ever appears as None here, the backfill or writer enforcement has regressed.
|
||||
"type_distribution": dict(Counter(c.get("type") for c in nrem_chunks if c.get("type"))),
|
||||
"status": "ok",
|
||||
}
|
||||
}
|
||||
@@ -532,7 +822,8 @@ def dream_pipeline():
|
||||
print("\n[Early REM] Retrieving...")
|
||||
# Early REM excludes previously retrieved + NREM high-scorers only (not full session_retrieved)
|
||||
# Sources that scored in Early REM band during NREM remain available
|
||||
early_chunks = retrieve("early-rem", excluded_sources=previously_retrieved | nrem_high_sources)
|
||||
early_chunks = retrieve("early-rem", excluded_sources=nrem_high_sources,
|
||||
type_filter=type_filter, signal=signal)
|
||||
session_retrieved.update(c["source"] for c in early_chunks)
|
||||
if not early_chunks:
|
||||
print("[Early REM] No suitable chunks — skipping")
|
||||
@@ -546,18 +837,20 @@ def dream_pipeline():
|
||||
stage_data["early_rem"] = {
|
||||
"chunks_retrieved": len(early_chunks),
|
||||
"avg_similarity": round(sum(c["relevance"] for c in early_chunks) / len(early_chunks), 3),
|
||||
"query": "career decision personal change what matters next",
|
||||
"query": "[llm-generated from observation signal]",
|
||||
"word_count": len(early_rem_output.split()),
|
||||
"sources": early_sources,
|
||||
"distinct_folders": early_folders,
|
||||
"folder_count": len(early_folders),
|
||||
"type_distribution": dict(Counter(c.get("type") for c in early_chunks if c.get("type"))),
|
||||
"status": "ok",
|
||||
}
|
||||
print(f"[Early REM] Done.\n{early_rem_output[:200]}...")
|
||||
|
||||
# ── Stage 3: Late REM — informed by NREM + Early REM ──────────────────
|
||||
print("\n[Late REM] Retrieving...")
|
||||
late_chunks = retrieve("late-rem", excluded_sources=previously_retrieved | session_retrieved)
|
||||
late_chunks = retrieve("late-rem", excluded_sources=session_retrieved,
|
||||
type_filter=type_filter, signal=signal)
|
||||
session_retrieved.update(c["source"] for c in late_chunks)
|
||||
if not late_chunks:
|
||||
print("[Late REM] No suitable chunks — skipping")
|
||||
@@ -576,12 +869,13 @@ def dream_pipeline():
|
||||
stage_data["late_rem"] = {
|
||||
"chunks_retrieved": len(late_chunks),
|
||||
"avg_similarity": round(sum(c["relevance"] for c in late_chunks) / len(late_chunks), 3),
|
||||
"query": "practice place memory making",
|
||||
"query": "[llm-generated from observation signal]",
|
||||
"word_count": len(late_rem_output.split()),
|
||||
"sources": late_sources,
|
||||
"distinct_folders": list(set(late_folders)),
|
||||
"folder_count": len(set(late_folders)),
|
||||
"cross_domain_pairs": cross_domain_pairs,
|
||||
"type_distribution": dict(Counter(c.get("type") for c in late_chunks if c.get("type"))),
|
||||
"status": "ok",
|
||||
}
|
||||
print(f"[Late REM] Done.\n{late_rem_output[:200]}...")
|
||||
@@ -603,8 +897,20 @@ def dream_pipeline():
|
||||
# Write manifest
|
||||
all_session_sources = list(session_retrieved)
|
||||
all_session_folders = list({extract_folder(s) for s in all_session_sources})
|
||||
total_chunks = 0
|
||||
pg = None
|
||||
try:
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute("SELECT COUNT(*) FROM embeddings")
|
||||
total_chunks = cur.fetchone()[0]
|
||||
except Exception as e:
|
||||
print(f"total_chunks query failed (non-critical): {e}")
|
||||
finally:
|
||||
if pg is not None:
|
||||
pg.close()
|
||||
corpus_data = {
|
||||
"total_chunks": delta.get("new_chunks", 0),
|
||||
"total_chunks": total_chunks,
|
||||
"new_chunks_since_last_dream": delta.get("new_chunks", 0),
|
||||
"days_since_last_dream": round(delta.get("days_since_dream", 0), 2),
|
||||
"substrate": "pgvector",
|
||||
@@ -616,18 +922,11 @@ def dream_pipeline():
|
||||
}
|
||||
write_manifest(datetime.now().strftime("%Y-%m-%d"), stage_data, corpus_data)
|
||||
|
||||
# Update state and notify
|
||||
state = load_dreamer_state()
|
||||
# Update state and notify (reuse state from start of pipeline; legacy key already popped)
|
||||
state["last_dream_timestamp"] = datetime.now().timestamp()
|
||||
state["last_dream_mode"] = "pipeline"
|
||||
state["last_dream_file"] = synthesis_file
|
||||
|
||||
# Accumulate retrieved sources across nights. Cap at 500, trim to 400 on overflow.
|
||||
all_retrieved = list(previously_retrieved | session_retrieved)
|
||||
if len(all_retrieved) > 500:
|
||||
all_retrieved = all_retrieved[-400:]
|
||||
state["retrieved_sources"] = all_retrieved
|
||||
|
||||
save_dreamer_state(state)
|
||||
|
||||
notify_sse("synthesis", synthesis_file.split("/")[-1])
|
||||
@@ -635,10 +934,10 @@ def dream_pipeline():
|
||||
return synthesis_file
|
||||
|
||||
|
||||
def dream_lucid(task):
|
||||
def dream_lucid(task, type_filter=None):
|
||||
"""On-demand lucid dream — single mode, used by Dream Now in settings."""
|
||||
print(f"Lucid dream starting — task: {task[:80] if task else 'none'}")
|
||||
chunks = retrieve("lucid", task=task)
|
||||
chunks = retrieve("lucid", task=task, type_filter=type_filter)
|
||||
if not chunks:
|
||||
print("No suitable chunks — aborting")
|
||||
return None
|
||||
@@ -660,13 +959,13 @@ def dream_lucid(task):
|
||||
return filepath
|
||||
|
||||
|
||||
def dream_single(mode, task=None):
|
||||
def dream_single(mode, task=None, type_filter=None):
|
||||
"""
|
||||
Single mode — used by Dream Now for non-lucid modes.
|
||||
Runs one stage independently (for testing/tuning individual stages).
|
||||
"""
|
||||
print(f"Single mode dream: {mode}")
|
||||
chunks = retrieve(mode, task=task)
|
||||
chunks = retrieve(mode, task=task, type_filter=type_filter)
|
||||
if not chunks:
|
||||
print("No suitable chunks — aborting")
|
||||
return None
|
||||
@@ -703,12 +1002,19 @@ if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="Aaron AI Dreamer")
|
||||
parser.add_argument("--mode", choices=["nrem", "early-rem", "late-rem", "lucid", "pipeline"])
|
||||
parser.add_argument("--task", type=str)
|
||||
parser.add_argument(
|
||||
"--type-filter", type=str, default=None,
|
||||
help="Comma-separated embeddings.type allowlist (e.g. 'document,aaronai_conversation'). "
|
||||
"Applies to pgvector retrieval only; Graphiti chunks are not filtered. "
|
||||
"Experimental — default is no filter, no behavior change.",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
type_filter = [t.strip() for t in args.type_filter.split(",")] if args.type_filter else None
|
||||
|
||||
if args.mode == "lucid":
|
||||
dream_lucid(args.task or "What should I be thinking about that I am not?")
|
||||
dream_lucid(args.task or "What should I be thinking about that I am not?", type_filter=type_filter)
|
||||
elif args.mode and args.mode != "pipeline":
|
||||
dream_single(args.mode, args.task)
|
||||
dream_single(args.mode, args.task, type_filter=type_filter)
|
||||
else:
|
||||
# Default: full pipeline
|
||||
dream_pipeline()
|
||||
dream_pipeline(type_filter=type_filter)
|
||||
|
||||
@@ -0,0 +1,235 @@
|
||||
"""
|
||||
Dreamer Stages 1 + 2 — Observe and Select.
|
||||
|
||||
Implements `dreamer-design-spec.md`'s Stage 1 (observe_corpus) and Stage 2
|
||||
(select_mode). These have been latent in dream.py — observe_corpus existed
|
||||
in skeletal form but its output was largely unused; select_mode did not
|
||||
exist at all. The dreamer always ran all stages with hardcoded queries.
|
||||
|
||||
Per spec (lines 27–34 of dreamer-design-spec.md):
|
||||
delta = observe_corpus()
|
||||
selected_mode = select_mode(delta, task, project)
|
||||
if selected_mode is None:
|
||||
return # nothing worth dreaming
|
||||
|
||||
The "returns None — dreamer goes quiet rather than manufacturing novelty"
|
||||
semantics (spec line 67) is the canonical answer to the repetition problem
|
||||
documented in birdai-dreamer-exclusion-finding-2026-05-02.md.
|
||||
|
||||
Grounded in:
|
||||
- Active Inference (Friston 2010, 2017) — observe error, choose action that
|
||||
minimizes free energy. The dreamer is a prediction-error machine; observe
|
||||
what's diverged from the model, dream about that.
|
||||
- Sleep stages (Stickgold 2005; Walker 2017; Diekelberg & Born 2010) — NREM
|
||||
for replay of new traces, REM for associative cross-cluster integration.
|
||||
- Sharp-wave ripples (Buzsáki, Wilson) — biology tags WHAT to replay
|
||||
(under-processed chunks); not uniform. Implemented via the consolidation
|
||||
cursor on the embeddings table.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import sqlite3
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
|
||||
from dotenv import load_dotenv
|
||||
import psycopg2
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
# ─── Paths ──────────────────────────────────────────────────────────────────
|
||||
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
CONVERSATIONS_DB = str(Path.home() / "aaronai" / "conversations.db")
|
||||
WATCHER_STATE = str(Path.home() / "aaronai" / "watcher_state.json")
|
||||
DREAMER_STATE = str(Path.home() / "aaronai" / "dreamer_state.json")
|
||||
JOURNAL_DAILY = "/home/aaron/nextcloud/data/data/aaron/files/Journal/Daily"
|
||||
|
||||
# ─── Thresholds ─────────────────────────────────────────────────────────────
|
||||
# Per spec, these become settings-panel controls eventually. For now they're
|
||||
# constants here; moving them to a config module is task #48.
|
||||
|
||||
NEW_CHUNK_THRESHOLD = 5 # below this, NREM not warranted on novelty alone
|
||||
STALENESS_TRIGGER_DAYS = 3 # corpus quiet ≥3 days → Late REM ("shake things loose")
|
||||
QUESTION_LOOKBACK_DAYS = 14 # spec line 61: "the last 14 days"
|
||||
UNDERPROCESSED_PERCENTILE = 0.25 # bottom quartile of consolidation_count
|
||||
|
||||
|
||||
# ─── Helpers ────────────────────────────────────────────────────────────────
|
||||
|
||||
def _get_pg():
|
||||
return psycopg2.connect(PG_DSN)
|
||||
|
||||
|
||||
def _load_json(path, default):
|
||||
try:
|
||||
return json.loads(Path(path).read_text())
|
||||
except Exception:
|
||||
return default
|
||||
|
||||
|
||||
def _recent_user_questions(days=QUESTION_LOOKBACK_DAYS, limit=20):
|
||||
"""Pull recent user-turn content from conversations.db. The spec calls
|
||||
these 'live questions' — what Aaron has been asking about. They become
|
||||
seed material for the REM modes."""
|
||||
try:
|
||||
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||
cutoff = (datetime.now() - timedelta(days=days)).isoformat()
|
||||
cur = conn.cursor()
|
||||
cur.execute(
|
||||
"""
|
||||
SELECT m.content FROM messages m
|
||||
JOIN conversations c ON m.conversation_id = c.id
|
||||
WHERE m.role = 'user' AND c.updated_at > ?
|
||||
ORDER BY m.timestamp DESC LIMIT ?
|
||||
""",
|
||||
(cutoff, limit),
|
||||
)
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
return [r[0][:280] for r in rows]
|
||||
except Exception:
|
||||
return []
|
||||
|
||||
|
||||
def _new_journal_entries(since_ts):
|
||||
"""Files in Journal/Daily/ created or modified since the last dream.
|
||||
Journal entries with emotional/personal register route to Early REM per
|
||||
the spec (line 71)."""
|
||||
journal_path = Path(JOURNAL_DAILY)
|
||||
if not journal_path.exists():
|
||||
return []
|
||||
new = []
|
||||
for p in journal_path.rglob("*.md"):
|
||||
try:
|
||||
if p.stat().st_mtime > since_ts:
|
||||
new.append(str(p.relative_to(journal_path)))
|
||||
except OSError:
|
||||
continue
|
||||
return new
|
||||
|
||||
|
||||
def _new_chunks_count(since_ts):
|
||||
"""Files in the watcher state with mtime > last_dream. The spec calls
|
||||
this 'what changed' (line 58). Used as the NREM novelty signal."""
|
||||
state = _load_json(WATCHER_STATE, {})
|
||||
count = 0
|
||||
for _path, mtime in state.items():
|
||||
try:
|
||||
if float(mtime) > since_ts:
|
||||
count += 1
|
||||
except (ValueError, TypeError):
|
||||
continue
|
||||
return count
|
||||
|
||||
|
||||
def _underprocessed_chunk_count():
|
||||
"""Chunks below the underprocessed percentile by consolidation_count.
|
||||
Biologically motivated: sharp-wave ripples bias replay toward novel /
|
||||
under-encoded experience, not uniform sampling. We give NREM a pool of
|
||||
'least-replayed' chunks to draw from in Stage 3."""
|
||||
try:
|
||||
pg = _get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute(
|
||||
"""
|
||||
WITH t AS (
|
||||
SELECT percentile_cont(%s) WITHIN GROUP (ORDER BY consolidation_count)
|
||||
AS threshold
|
||||
FROM embeddings
|
||||
)
|
||||
SELECT COUNT(*) FROM embeddings, t
|
||||
WHERE consolidation_count <= t.threshold
|
||||
""",
|
||||
(UNDERPROCESSED_PERCENTILE,),
|
||||
)
|
||||
result = cur.fetchone()[0]
|
||||
pg.close()
|
||||
return int(result or 0)
|
||||
except Exception:
|
||||
return 0
|
||||
|
||||
|
||||
# ─── Stage 1: observe_corpus ────────────────────────────────────────────────
|
||||
|
||||
def observe_corpus():
|
||||
"""Build the signal vector consumed by select_mode and (downstream) by
|
||||
retrieve. Concrete observations only — no interpretation. Each key is
|
||||
a direct measurement from the corpus, watcher, journal, or conversation
|
||||
log.
|
||||
|
||||
Returns a dict with:
|
||||
now_ts -- current Unix timestamp
|
||||
last_dream_ts -- last completed dream timestamp (0 if never)
|
||||
days_since_dream -- float; inf if never dreamed
|
||||
new_chunks -- count of files newer than last_dream
|
||||
new_journal_entries -- list of Journal/Daily/*.md filenames since last_dream
|
||||
recent_questions -- user-turn content from last 14 days
|
||||
underprocessed_count -- chunks in the bottom 25% by consolidation_count
|
||||
"""
|
||||
state = _load_json(DREAMER_STATE, {})
|
||||
last_dream_ts = float(state.get("last_dream_timestamp", 0) or 0)
|
||||
now_ts = datetime.now().timestamp()
|
||||
|
||||
return {
|
||||
"now_ts": now_ts,
|
||||
"last_dream_ts": last_dream_ts,
|
||||
"days_since_dream": (now_ts - last_dream_ts) / 86400 if last_dream_ts else float("inf"),
|
||||
"new_chunks": _new_chunks_count(last_dream_ts),
|
||||
"new_journal_entries": _new_journal_entries(last_dream_ts),
|
||||
"recent_questions": _recent_user_questions(),
|
||||
"underprocessed_count": _underprocessed_chunk_count(),
|
||||
}
|
||||
|
||||
|
||||
# ─── Stage 2: select_mode ───────────────────────────────────────────────────
|
||||
|
||||
def select_mode(signal, task=None, explicit_mode=None):
|
||||
"""Return one of {'nrem', 'early-rem', 'late-rem', 'lucid'}. Never None.
|
||||
|
||||
The dreamer fires every scheduled night. The earlier "go quiet on null
|
||||
delta" rule was a synthesis-doc invention that didn't match the actual
|
||||
desired UX — the original dreamer always dreamed, even if it repeated
|
||||
itself. The cure for repetition lives in the retrieve layer
|
||||
(LLM-generated queries from the observation signal, MMR diversity,
|
||||
cursor bias toward under-processed chunks), not in skipping nights.
|
||||
|
||||
Routing logic:
|
||||
- explicit_mode argument wins
|
||||
- task supplied → 'lucid' (question-anchored)
|
||||
- days_since_dream ≥ STALENESS_TRIGGER_DAYS → 'late-rem' (shake loose
|
||||
via cross-domain pairs when nothing's been added in a while)
|
||||
- new journal entry → 'early-rem' (emotional/personal register)
|
||||
- default → 'nrem' (replay-and-consolidation; always has something to
|
||||
do because the corpus always has under-processed chunks)
|
||||
"""
|
||||
if explicit_mode:
|
||||
return explicit_mode
|
||||
if task:
|
||||
return "lucid"
|
||||
|
||||
days_since = signal["days_since_dream"]
|
||||
new_journal = signal["new_journal_entries"]
|
||||
|
||||
if days_since >= STALENESS_TRIGGER_DAYS:
|
||||
return "late-rem"
|
||||
|
||||
if new_journal:
|
||||
return "early-rem"
|
||||
|
||||
return "nrem"
|
||||
|
||||
|
||||
# ─── CLI for manual inspection ──────────────────────────────────────────────
|
||||
|
||||
if __name__ == "__main__":
|
||||
signal = observe_corpus()
|
||||
short = {k: v for k, v in signal.items() if k != "recent_questions"}
|
||||
print("Signal (excluding recent_questions):")
|
||||
print(json.dumps(short, indent=2, default=str))
|
||||
print(f"\nRecent user questions ({len(signal['recent_questions'])}):")
|
||||
for q in signal["recent_questions"][:5]:
|
||||
print(f" - {q[:140]}")
|
||||
mode = select_mode(signal)
|
||||
print(f"\nselect_mode() → {mode!r}")
|
||||
+240
-29
@@ -1,17 +1,20 @@
|
||||
"""
|
||||
Aaron AI Stage 1 encoding helpers — single canonical implementation of:
|
||||
- extract_text(filepath) — four-extension text extraction
|
||||
- chunk_text(text, chunk_size, overlap) — word-based chunking
|
||||
- chunk_and_embed(text, source, embedder, filepath, folder) — produce ready-to-write rows
|
||||
- extract_blocks(filepath) — section-aware extraction (docx heading-bounded
|
||||
sections, pptx per-slide, pdf/txt/md single-block)
|
||||
- extract_text(filepath) — back-compat string concatenation over blocks
|
||||
- chunk_text(text, chunk_size, overlap) — word-based blind chunking
|
||||
- chunk_and_embed(text_or_blocks, source, embedder, filepath, folder) —
|
||||
produce ready-to-write rows. Accepts str (blind) or list[dict] (section-aware).
|
||||
- write_embeddings_batch(conn, batch) — server-side NOW() canonical INSERT
|
||||
|
||||
Used by watcher.py, ingest.py, corpus_integrity.py, and api.py /api/corpus/retry.
|
||||
Replaces four separate extract reimplementations and two extract-chunk-embed paths.
|
||||
"""
|
||||
|
||||
import hashlib
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
from pathlib import Path
|
||||
|
||||
from docx import Document as DocxDocument
|
||||
@@ -24,33 +27,187 @@ SUPPORTED = {".docx", ".pdf", ".pptx", ".txt", ".md"}
|
||||
DEFAULT_CHUNK_SIZE = 500
|
||||
DEFAULT_CHUNK_OVERLAP = 50
|
||||
|
||||
_BOLD_KV_RE = re.compile(r"^\*\*[\w +/-]+?:\*\*")
|
||||
|
||||
def extract_text(filepath: Path) -> str:
|
||||
"""Return the text of a supported file. Returns "" on any failure or
|
||||
unsupported extension. Does not write to ingest_failures — caller decides."""
|
||||
|
||||
def _strip_md_frontmatter(text: str) -> str:
|
||||
"""Strip a leading frontmatter block from markdown, if present.
|
||||
|
||||
Recognizes two formats:
|
||||
- YAML-style: file's first non-empty line is `---`, terminated by `---`.
|
||||
Only triggered when no heading precedes — guards against `---`
|
||||
horizontal rules that follow an H1.
|
||||
- Capture-style: optional H1 heading, then one or more `**key:** value`
|
||||
lines (and blanks), terminated by `---`. The H1 is preserved; the
|
||||
key/value block + separator are removed.
|
||||
|
||||
Body `---` rules and body `**bold:**` lines are never touched — the scan
|
||||
aborts as soon as a non-frontmatter line appears in the leading block.
|
||||
"""
|
||||
lines = text.splitlines()
|
||||
n = len(lines)
|
||||
i = 0
|
||||
while i < n and not lines[i].strip():
|
||||
i += 1
|
||||
heading = None
|
||||
if i < n and lines[i].startswith("# "):
|
||||
heading = lines[i]
|
||||
i += 1
|
||||
while i < n and not lines[i].strip():
|
||||
i += 1
|
||||
if i >= n:
|
||||
return text
|
||||
first = lines[i].strip()
|
||||
if heading is None and first == "---":
|
||||
j = i + 1
|
||||
while j < n and lines[j].strip() != "---":
|
||||
j += 1
|
||||
if j >= n:
|
||||
return text
|
||||
body_start = j + 1
|
||||
elif _BOLD_KV_RE.match(first):
|
||||
j = i
|
||||
while j < n:
|
||||
s = lines[j].strip()
|
||||
if not s or _BOLD_KV_RE.match(s):
|
||||
j += 1
|
||||
continue
|
||||
if s == "---":
|
||||
body_start = j + 1
|
||||
break
|
||||
return text
|
||||
else:
|
||||
return text
|
||||
else:
|
||||
return text
|
||||
body = "\n".join(lines[body_start:]).lstrip("\n")
|
||||
return f"{heading}\n\n{body}" if heading else body
|
||||
|
||||
|
||||
def _docx_cell_paragraphs(cell):
|
||||
yield from (p for p in cell.paragraphs if p.text.strip())
|
||||
for nested in cell.tables:
|
||||
for row in nested.rows:
|
||||
for c in row.cells:
|
||||
yield from _docx_cell_paragraphs(c)
|
||||
|
||||
|
||||
def _pptx_shape_text(shape):
|
||||
from pptx.enum.shapes import MSO_SHAPE_TYPE
|
||||
parts = []
|
||||
if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
|
||||
for sub in shape.shapes:
|
||||
parts.extend(_pptx_shape_text(sub))
|
||||
return parts
|
||||
if hasattr(shape, "text") and shape.text.strip():
|
||||
parts.append(shape.text)
|
||||
if getattr(shape, "has_table", False):
|
||||
for cell in shape.table.iter_cells():
|
||||
if cell.text.strip():
|
||||
parts.append(cell.text)
|
||||
return parts
|
||||
|
||||
|
||||
def _extract_docx_blocks(filepath: Path) -> list[dict]:
|
||||
"""Return docx content as a single block. Earlier attempt at section-aware
|
||||
chunking via Heading styles was rolled back: the user's docs are mostly
|
||||
Normal-styled with bold-as-heading, and tying chunk boundaries to formatting
|
||||
choices locks future-them into preserving those choices forever. Lexical
|
||||
+ cross-encoder retrieval already finds the right substrings within a
|
||||
blind-chunked CV, so the section structure isn't load-bearing for retrieval."""
|
||||
from docx.oxml.ns import qn
|
||||
|
||||
doc = DocxDocument(filepath)
|
||||
parts = [p.text for p in doc.paragraphs if p.text.strip()]
|
||||
for tbl in doc.tables:
|
||||
for row in tbl.rows:
|
||||
for cell in row.cells:
|
||||
parts.extend(p.text for p in _docx_cell_paragraphs(cell))
|
||||
for section in doc.sections:
|
||||
parts.extend(p.text for p in section.header.paragraphs if p.text.strip())
|
||||
parts.extend(p.text for p in section.footer.paragraphs if p.text.strip())
|
||||
for txbx in doc.element.body.findall(".//" + qn("w:txbxContent")):
|
||||
for p in txbx.findall(".//" + qn("w:p")):
|
||||
text = "".join(t.text or "" for t in p.findall(".//" + qn("w:t")))
|
||||
if text.strip():
|
||||
parts.append(text)
|
||||
text = "\n".join(parts)
|
||||
return [{"heading": None, "text": text, "kind": "doc"}] if text.strip() else []
|
||||
|
||||
|
||||
def _extract_pptx_blocks(filepath: Path) -> list[dict]:
|
||||
"""One block per slide. Heading = slide title (or 'Slide N' fallback).
|
||||
Body = non-title shape text + speaker notes."""
|
||||
prs = Presentation(filepath)
|
||||
blocks = []
|
||||
for i, slide in enumerate(prs.slides, 1):
|
||||
title_shape = None
|
||||
try:
|
||||
title_shape = slide.shapes.title
|
||||
except (AttributeError, KeyError):
|
||||
pass
|
||||
title = None
|
||||
body_parts = []
|
||||
for shape in slide.shapes:
|
||||
if title_shape is not None and shape == title_shape and shape.has_text_frame:
|
||||
title = shape.text_frame.text.strip() or None
|
||||
continue
|
||||
body_parts.extend(_pptx_shape_text(shape))
|
||||
if slide.has_notes_slide:
|
||||
notes = slide.notes_slide.notes_text_frame.text
|
||||
if notes.strip():
|
||||
body_parts.append(f"[Notes] {notes}")
|
||||
if title or body_parts:
|
||||
blocks.append({
|
||||
"heading": title or f"Slide {i}",
|
||||
"text": "\n".join(body_parts),
|
||||
"kind": "slide",
|
||||
})
|
||||
return blocks
|
||||
|
||||
|
||||
def extract_blocks(filepath: Path) -> list[dict]:
|
||||
"""Structured extraction. Returns list of {heading, text, kind} blocks.
|
||||
|
||||
- docx: section-aware via Heading-style paragraphs (kind='section').
|
||||
- pptx: one block per slide (kind='slide').
|
||||
- pdf/txt/md: single block, no heading (kind='doc').
|
||||
|
||||
Empty list on any failure or unsupported extension."""
|
||||
suffix = filepath.suffix.lower()
|
||||
try:
|
||||
if suffix == ".docx":
|
||||
doc = DocxDocument(filepath)
|
||||
return "\n".join(p.text for p in doc.paragraphs if p.text.strip())
|
||||
elif suffix == ".pdf":
|
||||
return _extract_docx_blocks(filepath)
|
||||
if suffix == ".pptx":
|
||||
return _extract_pptx_blocks(filepath)
|
||||
if suffix == ".pdf":
|
||||
reader = PdfReader(filepath)
|
||||
return "".join(
|
||||
text = "".join(
|
||||
page.extract_text() + "\n"
|
||||
for page in reader.pages if page.extract_text()
|
||||
)
|
||||
elif suffix == ".pptx":
|
||||
prs = Presentation(filepath)
|
||||
return "\n".join(
|
||||
shape.text for slide in prs.slides
|
||||
for shape in slide.shapes
|
||||
if hasattr(shape, "text") and shape.text.strip()
|
||||
)
|
||||
elif suffix in {".txt", ".md"}:
|
||||
return filepath.read_text(encoding="utf-8", errors="ignore")
|
||||
return [{"heading": None, "text": text, "kind": "doc"}] if text.strip() else []
|
||||
if suffix in {".txt", ".md"}:
|
||||
text = filepath.read_text(encoding="utf-8", errors="ignore")
|
||||
if suffix == ".md":
|
||||
text = _strip_md_frontmatter(text)
|
||||
return [{"heading": None, "text": text, "kind": "doc"}] if text.strip() else []
|
||||
except Exception as e:
|
||||
log.warning(f"Text extraction failed for {filepath.name}: {e}")
|
||||
return ""
|
||||
log.warning(f"Extraction failed for {filepath.name}: {e}")
|
||||
return []
|
||||
|
||||
|
||||
def extract_text(filepath: Path) -> str:
|
||||
"""Back-compat wrapper: concatenate extract_blocks() output. Section
|
||||
structure is lost; use extract_blocks() directly for chunking."""
|
||||
blocks = extract_blocks(filepath)
|
||||
parts = []
|
||||
for b in blocks:
|
||||
if b.get("heading"):
|
||||
parts.append(b["heading"])
|
||||
if b.get("text"):
|
||||
parts.append(b["text"])
|
||||
return "\n".join(parts)
|
||||
|
||||
|
||||
def chunk_text(text: str,
|
||||
@@ -73,18 +230,49 @@ def _chunk_id(filepath, source: str, index: int) -> str:
|
||||
return f"{hashlib.md5(basis.encode()).hexdigest()[:8]}_{index}"
|
||||
|
||||
|
||||
def chunk_and_embed(text: str,
|
||||
def chunk_and_embed(text_or_blocks,
|
||||
source: str,
|
||||
embedder,
|
||||
filepath=None,
|
||||
folder=None) -> list[dict]:
|
||||
"""Chunk text, embed each chunk, return rows ready for write_embeddings_batch."""
|
||||
chunks = chunk_text(text)
|
||||
"""Chunk + embed for write_embeddings_batch. Accepts either:
|
||||
|
||||
- str: blind chunking with 500-word windows (pdf/txt/md legacy path).
|
||||
- list[dict]: section-aware path (docx Heading-bounded sections, pptx
|
||||
slides). Each block emits one chunk if its text fits within
|
||||
DEFAULT_CHUNK_SIZE words, otherwise is blind-split with overlap.
|
||||
|
||||
The block heading is prepended to the chunk text (so retrieval sees the
|
||||
section context) and stored in metadata as heading/kind."""
|
||||
if isinstance(text_or_blocks, str):
|
||||
blocks = [{"heading": None, "text": text_or_blocks, "kind": "doc"}]
|
||||
else:
|
||||
blocks = text_or_blocks
|
||||
|
||||
chunks = []
|
||||
for block in blocks:
|
||||
body = block.get("text") or ""
|
||||
heading = block.get("heading")
|
||||
kind = block.get("kind", "doc")
|
||||
if not body.strip() and not (heading and heading.strip()):
|
||||
continue
|
||||
if heading and body.strip():
|
||||
contextualized = f"{heading}\n\n{body}"
|
||||
elif heading:
|
||||
contextualized = heading
|
||||
else:
|
||||
contextualized = body
|
||||
if len(contextualized.split()) <= DEFAULT_CHUNK_SIZE:
|
||||
chunks.append((contextualized, heading, kind))
|
||||
else:
|
||||
for sub in chunk_text(contextualized):
|
||||
chunks.append((sub, heading, kind))
|
||||
|
||||
if not chunks:
|
||||
return []
|
||||
embeddings = embedder.encode(chunks).tolist()
|
||||
embeddings = embedder.encode([c[0] for c in chunks]).tolist()
|
||||
rows = []
|
||||
for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
|
||||
for i, ((chunk, heading, kind), emb) in enumerate(zip(chunks, embeddings)):
|
||||
rows.append({
|
||||
"id": _chunk_id(filepath, source, i),
|
||||
"document": chunk,
|
||||
@@ -95,17 +283,37 @@ def chunk_and_embed(text: str,
|
||||
"source": source,
|
||||
"filepath": str(filepath) if filepath else source,
|
||||
"folder": folder,
|
||||
"heading": heading,
|
||||
"kind": kind,
|
||||
},
|
||||
})
|
||||
return rows
|
||||
|
||||
|
||||
def write_embeddings_batch(conn, batch: list[dict]) -> int:
|
||||
"""Single canonical INSERT. Sets created_at = NOW() server-side. Commits."""
|
||||
def write_embeddings_batch(conn, batch: list[dict], commit: bool = True) -> int:
|
||||
"""Single canonical INSERT. Sets created_at = NOW() server-side.
|
||||
|
||||
Every row dict must supply 'type'. created_at is SQL-supplied (NOW()), so
|
||||
callers do not need to provide it. The application-layer assertion is the
|
||||
primary enforcement point for type — the column lacks NOT NULL because
|
||||
historical NULLs were resolved by the Improvement #2 backfill, and a
|
||||
Python-level raise gives a faster, more debuggable failure than a
|
||||
Postgres constraint error.
|
||||
|
||||
When commit=True (default), this function commits the connection itself.
|
||||
When commit=False, the caller is responsible for committing. Use
|
||||
commit=False when composing this write with other writes that must land
|
||||
atomically in the same transaction.
|
||||
"""
|
||||
if not batch:
|
||||
return 0
|
||||
cur = conn.cursor()
|
||||
for row in batch:
|
||||
if not row.get("type"):
|
||||
raise ValueError(
|
||||
f"row {row.get('id')!r} missing 'type'; writers must supply it "
|
||||
f"(see Improvement #2 in docs/birdai-component-inventory)"
|
||||
)
|
||||
cur.execute("""
|
||||
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
|
||||
VALUES (%s, %s, %s::vector, %s, %s, NOW(), %s)
|
||||
@@ -113,8 +321,11 @@ def write_embeddings_batch(conn, batch: list[dict]) -> int:
|
||||
document = EXCLUDED.document,
|
||||
embedding = EXCLUDED.embedding,
|
||||
source = EXCLUDED.source,
|
||||
type = EXCLUDED.type,
|
||||
created_at = COALESCE(embeddings.created_at, EXCLUDED.created_at),
|
||||
metadata = EXCLUDED.metadata
|
||||
""", (row["id"], row["document"], row["embedding"],
|
||||
row["source"], row["type"], json.dumps(row["metadata"])))
|
||||
if commit:
|
||||
conn.commit()
|
||||
return len(batch)
|
||||
|
||||
@@ -0,0 +1,304 @@
|
||||
"""Backfill embeddings.type and embeddings.created_at (Improvement #2 / A.3).
|
||||
|
||||
Idempotent on cohort predicates (every WHERE clause includes IS NULL on the
|
||||
target column). Writes provenance to metadata.type_source and metadata.created_at_source
|
||||
so each row is auditable and revertable per-source. Default --dry-run=True.
|
||||
|
||||
Order of batches:
|
||||
T1. type backfill: WHERE type IS NULL -> 'document' (extension-classified, all hit).
|
||||
C1. created_at: WHERE ca IS NULL AND metadata.filepath stat-resolves -> filesystem mtime.
|
||||
C2. created_at: WHERE ca IS NULL AND source has unique watcher_state path -> watcher mtime.
|
||||
C3. created_at: WHERE ca IS NULL AND source has watcher_state collision -> most-recent mtime.
|
||||
C4. created_at: WHERE type='chatgpt_conversation' AND ca IS NULL -> export-resolved create_time.
|
||||
C5. created_at: WHERE ca IS NULL (residual) -> sentinel.
|
||||
|
||||
Snapshot table embeddings_backup_2026_05_03 must exist before --apply.
|
||||
|
||||
Usage:
|
||||
venv/bin/python3 scripts/experiments/embeddings_backfill_apply.py # dry-run
|
||||
venv/bin/python3 scripts/experiments/embeddings_backfill_apply.py --apply # write
|
||||
|
||||
Exits non-zero if snapshot is missing on --apply.
|
||||
"""
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
from collections import Counter, defaultdict
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import psycopg2
|
||||
from psycopg2.extras import RealDictCursor, Json
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
WATCHER_STATE = Path.home() / "aaronai" / "watcher_state.json"
|
||||
CHATGPT_EXPORT_DIR = Path("/home/aaron/nextcloud/data/data/aaron/files/Archive/Misc/ChatGPT Export")
|
||||
SNAPSHOT_TABLE = "embeddings_backup_2026_05_03"
|
||||
SENTINEL_ISO = "2026-04-26T00:00:00Z"
|
||||
|
||||
|
||||
# ─── Helpers ────────────────────────────────────────────────────────────────
|
||||
|
||||
def get_pg():
|
||||
return psycopg2.connect(PG_DSN, cursor_factory=RealDictCursor)
|
||||
|
||||
|
||||
def header(t):
|
||||
bar = "=" * 70
|
||||
print(f"\n{bar}\n{t}\n{bar}")
|
||||
|
||||
|
||||
def fmt_ts_unix(ts):
|
||||
return datetime.fromtimestamp(float(ts), tz=timezone.utc).isoformat().replace("+00:00", "Z")
|
||||
|
||||
|
||||
def fmt_ts_mtime(p):
|
||||
try:
|
||||
return datetime.fromtimestamp(p.stat().st_mtime, tz=timezone.utc).isoformat().replace("+00:00", "Z")
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def load_watcher_state():
|
||||
state = json.loads(WATCHER_STATE.read_text())
|
||||
by_name = defaultdict(list)
|
||||
for path, mtime in state.items():
|
||||
by_name[Path(path).name].append((path, mtime))
|
||||
return by_name
|
||||
|
||||
|
||||
def load_chatgpt_index():
|
||||
if not CHATGPT_EXPORT_DIR.exists():
|
||||
return {}
|
||||
index = {}
|
||||
for f in sorted(CHATGPT_EXPORT_DIR.glob("conversations*.json")):
|
||||
try:
|
||||
data = json.loads(f.read_text(encoding="utf-8"))
|
||||
except Exception:
|
||||
continue
|
||||
for convo in data:
|
||||
cid = convo.get("id") or convo.get("conversation_id")
|
||||
ct = convo.get("create_time")
|
||||
if cid and ct is not None:
|
||||
index[cid] = ct
|
||||
return index
|
||||
|
||||
|
||||
def assert_snapshot(cur):
|
||||
cur.execute("SELECT to_regclass(%s) AS t;", (SNAPSHOT_TABLE,))
|
||||
if cur.fetchone()["t"] is None:
|
||||
print(f"ERROR: snapshot table '{SNAPSHOT_TABLE}' not found. Run A.2 first.")
|
||||
sys.exit(2)
|
||||
cur.execute(f"SELECT COUNT(*) AS n FROM {SNAPSHOT_TABLE};")
|
||||
snap = cur.fetchone()["n"]
|
||||
cur.execute("SELECT COUNT(*) AS n FROM embeddings;")
|
||||
live = cur.fetchone()["n"]
|
||||
print(f"snapshot {SNAPSHOT_TABLE}: {snap} rows; live embeddings: {live} rows")
|
||||
if snap != live:
|
||||
print(f"ERROR: snapshot row count != live ({snap} vs {live}). Refresh snapshot before --apply.")
|
||||
sys.exit(2)
|
||||
|
||||
|
||||
# ─── Batch primitive ────────────────────────────────────────────────────────
|
||||
|
||||
def run_batch(cur, label, candidates, apply_mode):
|
||||
"""candidates: list of (id, set_type, set_ca, type_source, ca_source).
|
||||
set_type / set_ca may be None to leave that column alone.
|
||||
In dry-run we still execute UPDATEs inside an outer transaction (rolled back
|
||||
at the end) so subsequent batches' SELECTs see the correct intermediate state."""
|
||||
n = len(candidates)
|
||||
print(f" {label}: {n} rows queued")
|
||||
if n == 0:
|
||||
return 0
|
||||
for c in candidates[:3]:
|
||||
print(f" sample: id={c[0]} type={c[1]!r} ca={c[2]!r} type_src={c[3]} ca_src={c[4]}")
|
||||
n_written = 0
|
||||
for row_id, set_type, set_ca, type_src, ca_src in candidates:
|
||||
meta_patch = {}
|
||||
if type_src:
|
||||
meta_patch["type_source"] = type_src
|
||||
if ca_src:
|
||||
meta_patch["created_at_source"] = ca_src
|
||||
# Build set list dynamically.
|
||||
sets, params = [], []
|
||||
if set_type is not None:
|
||||
sets.append("type = %s")
|
||||
params.append(set_type)
|
||||
if set_ca is not None:
|
||||
sets.append("created_at = %s")
|
||||
params.append(set_ca)
|
||||
if meta_patch:
|
||||
sets.append("metadata = COALESCE(metadata, '{}'::jsonb) || %s::jsonb")
|
||||
params.append(json.dumps(meta_patch))
|
||||
params.append(row_id)
|
||||
cur.execute(f"UPDATE embeddings SET {', '.join(sets)} WHERE id = %s;", params)
|
||||
n_written += cur.rowcount
|
||||
print(f" {n_written} rows updated{' (will rollback)' if not apply_mode else ''}")
|
||||
return n_written
|
||||
|
||||
|
||||
# ─── Batches ────────────────────────────────────────────────────────────────
|
||||
|
||||
def batch_T1_type(cur, apply_mode):
|
||||
"""type IS NULL -> 'document'. All cohort A rows have a SUPPORTED extension."""
|
||||
cur.execute("""
|
||||
SELECT id, source FROM embeddings WHERE type IS NULL ORDER BY id;
|
||||
""")
|
||||
rows = cur.fetchall()
|
||||
cands = [(r["id"], "document", None, "inferred_extension", None) for r in rows]
|
||||
return run_batch(cur, "T1 type IS NULL -> 'document'", cands, apply_mode)
|
||||
|
||||
|
||||
def batch_C1_filepath_stat(cur, apply_mode):
|
||||
"""ca IS NULL AND metadata.filepath stat-resolves -> mtime."""
|
||||
cur.execute("""
|
||||
SELECT id, source, metadata->>'filepath' AS fp
|
||||
FROM embeddings
|
||||
WHERE created_at IS NULL AND metadata->>'filepath' IS NOT NULL
|
||||
ORDER BY id;
|
||||
""")
|
||||
rows = cur.fetchall()
|
||||
cands, n_skipped_missing = [], 0
|
||||
for r in rows:
|
||||
p = Path(r["fp"])
|
||||
if p.exists():
|
||||
mt = fmt_ts_mtime(p)
|
||||
if mt:
|
||||
cands.append((r["id"], None, mt, None, "filepath_stat"))
|
||||
continue
|
||||
n_skipped_missing += 1
|
||||
print(f" C1 candidates: {len(cands)} (skipped {n_skipped_missing} where filepath gone or unstattable)")
|
||||
return run_batch(cur, "C1 ca IS NULL AND filepath stat-resolves -> mtime", cands, apply_mode)
|
||||
|
||||
|
||||
def batch_C2_C3_watcher_state(cur, apply_mode):
|
||||
"""ca IS NULL AND filepath unresolvable -> watcher_state by source basename.
|
||||
C2 = unique hit, C3 = collision pick-latest."""
|
||||
by_name = load_watcher_state()
|
||||
cur.execute("""
|
||||
SELECT id, source, metadata->>'filepath' AS fp
|
||||
FROM embeddings
|
||||
WHERE created_at IS NULL
|
||||
ORDER BY id;
|
||||
""")
|
||||
rows = cur.fetchall()
|
||||
c2, c3 = [], []
|
||||
skipped_no_match = 0
|
||||
for r in rows:
|
||||
# skip rows already targeted by C1 path
|
||||
if r["fp"] and Path(r["fp"]).exists():
|
||||
continue
|
||||
src = r["source"]
|
||||
if not src or src not in by_name:
|
||||
skipped_no_match += 1
|
||||
continue
|
||||
candidates = by_name[src]
|
||||
if len(candidates) == 1:
|
||||
mt = fmt_ts_unix(candidates[0][1])
|
||||
c2.append((r["id"], None, mt, None, "watcher_state_unique"))
|
||||
else:
|
||||
latest = max(candidates, key=lambda x: float(x[1]))
|
||||
mt = fmt_ts_unix(latest[1])
|
||||
c3.append((r["id"], None, mt, None, f"watcher_state_collision_pick_latest_of_{len(candidates)}"))
|
||||
print(f" C2/C3 source-basename fallback: {len(c2)} unique, {len(c3)} collision, "
|
||||
f"{skipped_no_match} unmatched (will fall to C4/C5)")
|
||||
n2 = run_batch(cur, "C2 ca IS NULL AND watcher_state unique -> mtime", c2, apply_mode)
|
||||
n3 = run_batch(cur, "C3 ca IS NULL AND watcher_state collision -> latest mtime", c3, apply_mode)
|
||||
return n2 + n3
|
||||
|
||||
|
||||
def batch_C4_chatgpt_export(cur, apply_mode):
|
||||
index = load_chatgpt_index()
|
||||
cur.execute("""
|
||||
SELECT id, source FROM embeddings
|
||||
WHERE type='chatgpt_conversation' AND created_at IS NULL ORDER BY id;
|
||||
""")
|
||||
rows = cur.fetchall()
|
||||
cands, unresolved = [], 0
|
||||
for r in rows:
|
||||
m = re.match(r"^chatgpt_(.+)_(\d+)$", r["id"])
|
||||
cid = m.group(1) if m else None
|
||||
ct = index.get(cid)
|
||||
if ct is None:
|
||||
unresolved += 1
|
||||
continue
|
||||
ct_iso = datetime.fromtimestamp(float(ct), tz=timezone.utc).isoformat().replace("+00:00", "Z")
|
||||
cands.append((r["id"], None, ct_iso, None, "chatgpt_export"))
|
||||
print(f" C4 chatgpt export resolution: {len(cands)} resolved, {unresolved} unresolved (fall to C5)")
|
||||
return run_batch(cur, "C4 type='chatgpt_conversation' AND ca IS NULL -> export create_time", cands, apply_mode)
|
||||
|
||||
|
||||
def batch_C5_sentinel(cur, apply_mode):
|
||||
cur.execute("""
|
||||
SELECT id, type, source FROM embeddings WHERE created_at IS NULL ORDER BY id;
|
||||
""")
|
||||
rows = cur.fetchall()
|
||||
cands = [(r["id"], None, SENTINEL_ISO, None, "sentinel") for r in rows]
|
||||
if cands:
|
||||
sample_types = Counter(r["type"] for r in rows)
|
||||
print(f" C5 residual sentinel rows by type: {dict(sample_types)}")
|
||||
return run_batch(cur, f"C5 ca IS NULL residual -> sentinel {SENTINEL_ISO}", cands, apply_mode)
|
||||
|
||||
|
||||
# ─── Pre/post counts ────────────────────────────────────────────────────────
|
||||
|
||||
def print_counts(cur, label):
|
||||
cur.execute("""
|
||||
SELECT
|
||||
COUNT(*) AS total,
|
||||
COUNT(*) FILTER (WHERE type IS NULL) AS type_null,
|
||||
COUNT(*) FILTER (WHERE created_at IS NULL) AS ca_null
|
||||
FROM embeddings;
|
||||
""")
|
||||
r = cur.fetchone()
|
||||
print(f" [{label}] total={r['total']} type_null={r['type_null']} ca_null={r['ca_null']}")
|
||||
|
||||
|
||||
# ─── Driver ─────────────────────────────────────────────────────────────────
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--apply", action="store_true", help="default false (dry-run)")
|
||||
args = ap.parse_args()
|
||||
apply_mode = args.apply
|
||||
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
|
||||
print(f"Mode: {'APPLY (writes will commit)' if apply_mode else 'DRY-RUN (no writes)'}")
|
||||
print(f"Sentinel: {SENTINEL_ISO}")
|
||||
|
||||
if apply_mode:
|
||||
assert_snapshot(cur)
|
||||
|
||||
header("PRE-COUNTS")
|
||||
print_counts(cur, "before")
|
||||
|
||||
header("BATCHES")
|
||||
n_t1 = batch_T1_type(cur, apply_mode)
|
||||
n_c1 = batch_C1_filepath_stat(cur, apply_mode)
|
||||
n_c2c3 = batch_C2_C3_watcher_state(cur, apply_mode)
|
||||
n_c4 = batch_C4_chatgpt_export(cur, apply_mode)
|
||||
n_c5 = batch_C5_sentinel(cur, apply_mode)
|
||||
|
||||
header("POST-COUNTS")
|
||||
print_counts(cur, "after" if apply_mode else "after (in-transaction, will rollback)")
|
||||
|
||||
if apply_mode:
|
||||
pg.commit()
|
||||
print("\nCOMMITTED.")
|
||||
else:
|
||||
pg.rollback()
|
||||
print("\nROLLED BACK (dry-run).")
|
||||
|
||||
print(f"\nSummary: T1={n_t1} C1={n_c1} C2+C3={n_c2c3} C4={n_c4} C5={n_c5}")
|
||||
pg.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,557 @@
|
||||
"""Read-only inspection for the embeddings.type / embeddings.created_at backfill (Improvement #2 / A.1).
|
||||
|
||||
Produces a survey of every backfill source-of-truth question without writing
|
||||
to the database. Output is a human-readable report on stdout plus a JSON
|
||||
sidecar at experiments/embeddings_backfill_inspection_<date>.json.
|
||||
|
||||
Sections:
|
||||
1. Cohort recap (counts; should match prior investigation).
|
||||
2. Cohort A type inference: extension classifier coverage.
|
||||
3. created_at inference for cohort A + B-doc-old:
|
||||
- rows with metadata.filepath: stat the file, check existence.
|
||||
- rows without filepath: lookup source against watcher_state.json.
|
||||
- filename-collision shape audit (live+backup, live+archive, ambiguous).
|
||||
4. ChatGPT export resolution (Plan A.1 addition #1):
|
||||
- existence of /home/aaron/nextcloud/.../ChatGPT Export/.
|
||||
- sample 5 B-chatgpt rows; resolve convo_id -> create_time.
|
||||
5. Sentinel date discovery (Plan A.1 addition #3):
|
||||
- earliest non-NULL created_at per type (already-populated rows are the
|
||||
lower bound for when the substrate started carrying timestamps).
|
||||
- git log for the pgvector migration commit.
|
||||
- any ChromaDB sqlite still on disk.
|
||||
- propose a sentinel with reasoning, or flag as arbitrary.
|
||||
6. 50-row stratified sample: derived (type, created_at, source) per row.
|
||||
|
||||
Usage: venv/bin/python3 scripts/experiments/embeddings_backfill_inspection.py
|
||||
|
||||
Read-only. No DB writes. No filesystem writes outside experiments/.
|
||||
"""
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
from collections import Counter, defaultdict
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import psycopg2
|
||||
from psycopg2.extras import RealDictCursor
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
WATCHER_STATE = Path.home() / "aaronai" / "watcher_state.json"
|
||||
CHATGPT_EXPORT_DIR = Path("/home/aaron/nextcloud/data/data/aaron/files/Archive/Misc/ChatGPT Export")
|
||||
NEXTCLOUD_ROOT = Path("/home/aaron/nextcloud/data/data/aaron/files")
|
||||
OUT_PATH = Path.home() / "aaronai" / "experiments" / f"embeddings_backfill_inspection_{datetime.now().strftime('%Y-%m-%d')}.json"
|
||||
|
||||
SUPPORTED_EXT = {".pdf", ".docx", ".pptx", ".txt", ".md"}
|
||||
|
||||
random.seed(20260503)
|
||||
|
||||
|
||||
# ─── Helpers ────────────────────────────────────────────────────────────────
|
||||
|
||||
def get_pg():
|
||||
return psycopg2.connect(PG_DSN, cursor_factory=RealDictCursor)
|
||||
|
||||
|
||||
def header(title):
|
||||
bar = "=" * 70
|
||||
print(f"\n{bar}\n{title}\n{bar}")
|
||||
|
||||
|
||||
def sub(title):
|
||||
print(f"\n--- {title} ---")
|
||||
|
||||
|
||||
def fmt_ts_from_unix(ts):
|
||||
"""Watcher state stores unix timestamps as strings."""
|
||||
try:
|
||||
return datetime.fromtimestamp(float(ts), tz=timezone.utc).isoformat().replace("+00:00", "Z")
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def fmt_ts_from_st_mtime(p):
|
||||
try:
|
||||
return datetime.fromtimestamp(p.stat().st_mtime, tz=timezone.utc).isoformat().replace("+00:00", "Z")
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def load_watcher_state():
|
||||
"""Returns (path -> mtime_str), and (basename -> [(path, mtime_str), ...])."""
|
||||
state = json.loads(WATCHER_STATE.read_text())
|
||||
by_path = state
|
||||
by_name = defaultdict(list)
|
||||
for path, mtime in state.items():
|
||||
by_name[Path(path).name].append((path, mtime))
|
||||
return by_path, by_name
|
||||
|
||||
|
||||
def classify_collision_shape(paths):
|
||||
"""Categorize a filename-collision group:
|
||||
- 'live+backup' : exactly one path doesn't contain backup/.bak markers
|
||||
and others do
|
||||
- 'live+archive' : exactly one is outside Archive/ and others are inside
|
||||
- 'multi-live' : >=2 paths look like live (no backup/archive markers)
|
||||
- 'all-archive' : every path is inside Archive/ or backup-like
|
||||
- 'other'
|
||||
"""
|
||||
def is_backup(p):
|
||||
s = p.lower()
|
||||
return ".bak" in s or "/backup" in s or "backups/" in s
|
||||
def is_archive(p):
|
||||
s = p.lower()
|
||||
return "/archive/" in s
|
||||
backups = [p for p in paths if is_backup(p)]
|
||||
archives = [p for p in paths if is_archive(p)]
|
||||
live = [p for p in paths if not is_backup(p) and not is_archive(p)]
|
||||
if len(live) == 1 and len(backups) >= 1 and len(archives) == 0:
|
||||
return "live+backup"
|
||||
if len(live) == 1 and len(archives) >= 1 and len(backups) == 0:
|
||||
return "live+archive"
|
||||
if len(live) == 1 and (len(backups) + len(archives)) >= 1:
|
||||
return "live+mixed-old"
|
||||
if len(live) >= 2:
|
||||
return "multi-live"
|
||||
if len(live) == 0:
|
||||
return "all-archive-or-backup"
|
||||
return "other"
|
||||
|
||||
|
||||
# ─── Section 1: Cohort recap ────────────────────────────────────────────────
|
||||
|
||||
def section_1_cohort_recap(cur):
|
||||
header("1. COHORT RECAP")
|
||||
cur.execute("""
|
||||
SELECT
|
||||
COUNT(*) AS total,
|
||||
COUNT(*) FILTER (WHERE type IS NULL) AS type_null,
|
||||
COUNT(*) FILTER (WHERE created_at IS NULL) AS ca_null,
|
||||
COUNT(*) FILTER (WHERE type IS NULL AND created_at IS NULL) AS both_null,
|
||||
COUNT(*) FILTER (WHERE type IS NOT NULL AND created_at IS NOT NULL) AS both_set
|
||||
FROM embeddings;
|
||||
""")
|
||||
overall = cur.fetchone()
|
||||
print(f"Total: {overall['total']} type_null: {overall['type_null']} "
|
||||
f"ca_null: {overall['ca_null']} both_null: {overall['both_null']} "
|
||||
f"both_set: {overall['both_set']}")
|
||||
|
||||
cur.execute("""
|
||||
SELECT type, created_at IS NULL AS ca_null, COUNT(*) AS n
|
||||
FROM embeddings GROUP BY type, ca_null ORDER BY type NULLS LAST, ca_null;
|
||||
""")
|
||||
cohorts = cur.fetchall()
|
||||
sub("Per-(type, ca_null) cohorts")
|
||||
for r in cohorts:
|
||||
print(f" type={r['type'] or 'NULL':<22} ca_null={r['ca_null']!s:<5} n={r['n']}")
|
||||
return {"overall": overall, "cohorts": cohorts}
|
||||
|
||||
|
||||
# ─── Section 2: Cohort A type inference ─────────────────────────────────────
|
||||
|
||||
def section_2_type_inference(cur):
|
||||
header("2. COHORT A TYPE INFERENCE (extension classifier)")
|
||||
cur.execute("""
|
||||
SELECT LOWER(SUBSTRING(source FROM '\.[^.]+$')) AS ext, COUNT(*) AS rows
|
||||
FROM embeddings WHERE type IS NULL
|
||||
GROUP BY ext ORDER BY rows DESC;
|
||||
""")
|
||||
by_ext = cur.fetchall()
|
||||
classified = sum(r["rows"] for r in by_ext if r["ext"] in SUPPORTED_EXT)
|
||||
unknown = sum(r["rows"] for r in by_ext if r["ext"] not in SUPPORTED_EXT)
|
||||
print(f"NULL-type rows by extension:")
|
||||
for r in by_ext:
|
||||
flag = "OK" if r["ext"] in SUPPORTED_EXT else "??"
|
||||
print(f" {flag} {r['ext'] or '(none)':<8} rows={r['rows']}")
|
||||
print(f"\nClassified as 'document' via extension: {classified}")
|
||||
print(f"Unclassifiable (no SUPPORTED extension): {unknown}")
|
||||
return {"by_ext": by_ext, "classified": classified, "unclassifiable": unknown}
|
||||
|
||||
|
||||
# ─── Section 3: created_at inference ────────────────────────────────────────
|
||||
|
||||
def section_3_created_at_inference(cur):
|
||||
header("3. CREATED_AT INFERENCE — file-derived rows")
|
||||
by_path, by_name = load_watcher_state()
|
||||
print(f"watcher_state.json: {len(by_path)} tracked paths, "
|
||||
f"{len(by_name)} distinct filenames, "
|
||||
f"{sum(1 for v in by_name.values() if len(v) > 1)} filename collisions")
|
||||
|
||||
# 3a. Rows with metadata.filepath: probe stat()
|
||||
sub("3a. Rows with metadata.filepath — stat probe")
|
||||
cur.execute("""
|
||||
SELECT id, source, metadata->>'filepath' AS filepath
|
||||
FROM embeddings
|
||||
WHERE created_at IS NULL AND metadata->>'filepath' IS NOT NULL;
|
||||
""")
|
||||
rows_with_fp = cur.fetchall()
|
||||
fp_exists = 0
|
||||
fp_missing = 0
|
||||
fp_outside_root = 0
|
||||
sample_resolved = []
|
||||
for r in rows_with_fp:
|
||||
p = Path(r["filepath"])
|
||||
if not str(p).startswith(str(NEXTCLOUD_ROOT)):
|
||||
fp_outside_root += 1
|
||||
if p.exists():
|
||||
fp_exists += 1
|
||||
if len(sample_resolved) < 5:
|
||||
sample_resolved.append({
|
||||
"id": r["id"], "source": r["source"],
|
||||
"filepath": str(p), "mtime": fmt_ts_from_st_mtime(p),
|
||||
})
|
||||
else:
|
||||
fp_missing += 1
|
||||
print(f" rows with metadata.filepath: {len(rows_with_fp)}")
|
||||
print(f" exists on disk: {fp_exists}")
|
||||
print(f" missing on disk: {fp_missing}")
|
||||
print(f" outside Nextcloud root: {fp_outside_root}")
|
||||
print(f" Sample of 5 resolved mtimes:")
|
||||
for s in sample_resolved:
|
||||
print(f" {s['id']:<15} {s['source'][:60]:<60} mtime={s['mtime']}")
|
||||
|
||||
# 3b. Rows without metadata.filepath: watcher_state lookup
|
||||
sub("3b. Rows without metadata.filepath — watcher_state lookup")
|
||||
cur.execute("""
|
||||
SELECT id, source FROM embeddings
|
||||
WHERE created_at IS NULL
|
||||
AND metadata->>'filepath' IS NULL
|
||||
AND type IS NULL OR (type='document' AND created_at IS NULL AND metadata->>'filepath' IS NULL);
|
||||
""")
|
||||
rows_no_fp = cur.fetchall()
|
||||
# Distinct source basenames to look up
|
||||
basenames_to_resolve = sorted({r["source"] for r in rows_no_fp if r["source"]})
|
||||
n_resolved_unique = sum(1 for n in basenames_to_resolve if len(by_name.get(n, [])) == 1)
|
||||
n_collision_unique = sum(1 for n in basenames_to_resolve if len(by_name.get(n, [])) > 1)
|
||||
n_unfound = sum(1 for n in basenames_to_resolve if n not in by_name)
|
||||
print(f" rows without filepath: {len(rows_no_fp)}")
|
||||
print(f" distinct source basenames to resolve: {len(basenames_to_resolve)}")
|
||||
print(f" unique watcher_state hit (no collision): {n_resolved_unique}")
|
||||
print(f" collision in watcher_state (>1 path): {n_collision_unique}")
|
||||
print(f" not in watcher_state at all: {n_unfound}")
|
||||
|
||||
# 3c. Collision-shape audit
|
||||
sub("3c. Collision-shape audit — all collisions in watcher_state")
|
||||
collisions = {n: [(p, m) for p, m in by_name[n]] for n in by_name if len(by_name[n]) > 1}
|
||||
shape_counts = Counter()
|
||||
rows_affected_by_shape = Counter()
|
||||
# Map from basename to count of NULL-ca rows that need it (rows_no_fp)
|
||||
rows_no_fp_by_name = Counter(r["source"] for r in rows_no_fp)
|
||||
sample_per_shape = defaultdict(list)
|
||||
for name, paths_mtimes in collisions.items():
|
||||
paths = [p for p, _ in paths_mtimes]
|
||||
shape = classify_collision_shape(paths)
|
||||
shape_counts[shape] += 1
|
||||
rows_affected_by_shape[shape] += rows_no_fp_by_name.get(name, 0)
|
||||
if len(sample_per_shape[shape]) < 3:
|
||||
entry = {
|
||||
"name": name,
|
||||
"rows_no_fp_using_this_name": rows_no_fp_by_name.get(name, 0),
|
||||
"candidates": [
|
||||
{"path": p, "mtime": fmt_ts_from_unix(m)}
|
||||
for p, m in sorted(paths_mtimes, key=lambda x: -float(x[1]))
|
||||
],
|
||||
}
|
||||
sample_per_shape[shape].append(entry)
|
||||
print(f" collisions in watcher_state: {len(collisions)}")
|
||||
print(f" shape breakdown:")
|
||||
for shape, n in shape_counts.most_common():
|
||||
print(f" {shape:<22} collisions={n:<4} rows_affected={rows_affected_by_shape[shape]}")
|
||||
print(f"\n Up-to-3 sample collisions per shape (sorted by mtime desc):")
|
||||
for shape, samples in sample_per_shape.items():
|
||||
print(f" [{shape}]")
|
||||
for s in samples:
|
||||
print(f" {s['name']} (rows_no_fp using this name: {s['rows_no_fp_using_this_name']})")
|
||||
for c in s["candidates"]:
|
||||
print(f" {c['mtime']} {c['path']}")
|
||||
|
||||
return {
|
||||
"watcher_state_paths": len(by_path),
|
||||
"watcher_state_basenames": len(by_name),
|
||||
"watcher_state_collisions": len(collisions),
|
||||
"rows_with_filepath": {
|
||||
"total": len(rows_with_fp),
|
||||
"exists": fp_exists, "missing": fp_missing,
|
||||
"outside_root": fp_outside_root,
|
||||
"sample": sample_resolved,
|
||||
},
|
||||
"rows_without_filepath": {
|
||||
"total": len(rows_no_fp),
|
||||
"distinct_basenames": len(basenames_to_resolve),
|
||||
"unique_hit": n_resolved_unique,
|
||||
"collision_hit": n_collision_unique,
|
||||
"unfound": n_unfound,
|
||||
},
|
||||
"collision_shapes": {
|
||||
"total": len(collisions),
|
||||
"shape_counts": dict(shape_counts),
|
||||
"rows_affected_by_shape": dict(rows_affected_by_shape),
|
||||
"samples": {k: v for k, v in sample_per_shape.items()},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
# ─── Section 4: ChatGPT export resolution ───────────────────────────────────
|
||||
|
||||
def section_4_chatgpt_export(cur):
|
||||
header("4. CHATGPT EXPORT RESOLUTION (Plan addition #1)")
|
||||
print(f"Probing: {CHATGPT_EXPORT_DIR}")
|
||||
if not CHATGPT_EXPORT_DIR.exists():
|
||||
print(" NOT FOUND — plan on sentinel for entire B-chatgpt cohort.")
|
||||
return {"export_dir_exists": False, "files": []}
|
||||
files = sorted(CHATGPT_EXPORT_DIR.glob("conversations*.json"))
|
||||
print(f" found {len(files)} export file(s):")
|
||||
for f in files:
|
||||
print(f" {f.name} size={f.stat().st_size:,} mtime={fmt_ts_from_st_mtime(f)}")
|
||||
|
||||
# Build convo_id -> create_time index from all export files.
|
||||
print("\nLoading export(s) to build convo_id -> create_time index...")
|
||||
convo_index = {}
|
||||
for f in files:
|
||||
try:
|
||||
data = json.loads(f.read_text(encoding="utf-8"))
|
||||
except Exception as e:
|
||||
print(f" failed to parse {f.name}: {e}")
|
||||
continue
|
||||
for convo in data:
|
||||
cid = convo.get("id") or convo.get("conversation_id")
|
||||
ct = convo.get("create_time")
|
||||
if cid and ct is not None:
|
||||
convo_index[cid] = ct
|
||||
print(f" indexed {len(convo_index)} conversations across {len(files)} export files")
|
||||
|
||||
# Sample 5 chatgpt_conversation rows; resolve.
|
||||
cur.execute("""
|
||||
SELECT id, source FROM embeddings
|
||||
WHERE type='chatgpt_conversation' AND created_at IS NULL
|
||||
ORDER BY random() LIMIT 5;
|
||||
""")
|
||||
sample = cur.fetchall()
|
||||
sub("Sample of 5 B-chatgpt rows: convo lookup")
|
||||
resolved = 0
|
||||
sample_results = []
|
||||
for r in sample:
|
||||
# IDs look like chatgpt_<uuid>_<idx>; uuid extends until last underscore.
|
||||
m = re.match(r"^chatgpt_(.+)_(\d+)$", r["id"])
|
||||
cid = m.group(1) if m else None
|
||||
ct = convo_index.get(cid)
|
||||
ct_iso = None
|
||||
if ct is not None:
|
||||
try:
|
||||
ct_iso = datetime.fromtimestamp(float(ct), tz=timezone.utc).isoformat().replace("+00:00", "Z")
|
||||
except Exception:
|
||||
ct_iso = None
|
||||
if ct_iso:
|
||||
resolved += 1
|
||||
sample_results.append({
|
||||
"id": r["id"], "source": r["source"], "convo_id": cid,
|
||||
"create_time": ct, "create_time_iso": ct_iso,
|
||||
"resolved": ct_iso is not None,
|
||||
})
|
||||
print(f" {r['id']} cid={cid}")
|
||||
print(f" -> create_time={ct} iso={ct_iso}")
|
||||
print(f"\nResolved {resolved}/5. "
|
||||
f"{'PROCEED with re-derive for full cohort.' if resolved == 5 else 'PARTIAL — plan re-derive + sentinel for unresolved.'}")
|
||||
|
||||
# Estimate full-cohort coverage by counting how many B-chatgpt convo_ids appear in the index.
|
||||
cur.execute("""
|
||||
SELECT DISTINCT regexp_replace(id, '^chatgpt_(.+)_\\d+$', '\\1') AS cid
|
||||
FROM embeddings WHERE type='chatgpt_conversation' AND created_at IS NULL;
|
||||
""")
|
||||
distinct_cids = [r["cid"] for r in cur.fetchall()]
|
||||
in_index = sum(1 for c in distinct_cids if c in convo_index)
|
||||
print(f"Full-cohort coverage estimate: {in_index} / {len(distinct_cids)} distinct convo_ids "
|
||||
f"resolvable from export.")
|
||||
return {
|
||||
"export_dir_exists": True,
|
||||
"files": [{"name": f.name, "size": f.stat().st_size, "mtime": fmt_ts_from_st_mtime(f)} for f in files],
|
||||
"convo_index_size": len(convo_index),
|
||||
"sample_results": sample_results,
|
||||
"sample_resolved": resolved,
|
||||
"full_cohort": {
|
||||
"distinct_convo_ids": len(distinct_cids),
|
||||
"resolvable_from_export": in_index,
|
||||
"unresolvable": len(distinct_cids) - in_index,
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
# ─── Section 5: Sentinel date discovery ─────────────────────────────────────
|
||||
|
||||
def section_5_sentinel(cur):
|
||||
header("5. SENTINEL DATE DISCOVERY (Plan addition #3)")
|
||||
|
||||
# 5a. Earliest non-NULL created_at per type: lower bound on substrate age.
|
||||
sub("5a. Earliest non-NULL created_at per type")
|
||||
cur.execute("""
|
||||
SELECT type, MIN(created_at) AS earliest, MAX(created_at) AS latest, COUNT(*) AS rows
|
||||
FROM embeddings WHERE created_at IS NOT NULL GROUP BY type ORDER BY type;
|
||||
""")
|
||||
rows = cur.fetchall()
|
||||
for r in rows:
|
||||
print(f" {r['type']:<22} earliest={r['earliest']:<32} latest={r['latest']}")
|
||||
|
||||
# 5b. git log for the pgvector-migration commit.
|
||||
sub("5b. Git log — pgvector migration commits")
|
||||
git_findings = []
|
||||
try:
|
||||
out = subprocess.run(
|
||||
["git", "log", "--all", "--format=%H %ci %s",
|
||||
"--", "deprecated/migrate_to_pgvector.py", "scripts/migrate_to_pgvector.py"],
|
||||
cwd=str(Path.home() / "aaronai"), capture_output=True, text=True, timeout=10,
|
||||
)
|
||||
for line in out.stdout.strip().splitlines():
|
||||
print(f" {line}")
|
||||
git_findings.append(line)
|
||||
except Exception as e:
|
||||
print(f" git log failed: {e}")
|
||||
# Also: when did the api/ingest scripts cut over to pgvector?
|
||||
try:
|
||||
out = subprocess.run(
|
||||
["git", "log", "--all", "--format=%H %ci %s", "--grep=pgvector", "-i"],
|
||||
cwd=str(Path.home() / "aaronai"), capture_output=True, text=True, timeout=10,
|
||||
)
|
||||
print("\n Commits mentioning pgvector:")
|
||||
for line in out.stdout.strip().splitlines()[:10]:
|
||||
print(f" {line}")
|
||||
git_findings.append(line)
|
||||
except Exception as e:
|
||||
print(f" git log (pgvector grep) failed: {e}")
|
||||
|
||||
# 5c. ChromaDB sqlite still on disk?
|
||||
sub("5c. ChromaDB dump on disk?")
|
||||
candidates = []
|
||||
for root in [Path.home() / "aaronai", Path.home() / "aaronai" / "db"]:
|
||||
if root.exists():
|
||||
for p in root.rglob("chroma*.sqlite*"):
|
||||
candidates.append({"path": str(p), "mtime": fmt_ts_from_st_mtime(p)})
|
||||
if candidates:
|
||||
for c in candidates:
|
||||
print(f" found: {c['path']} mtime={c['mtime']}")
|
||||
else:
|
||||
print(" no ChromaDB sqlite found under ~/aaronai")
|
||||
|
||||
# 5d. Propose sentinel.
|
||||
sub("5d. Sentinel proposal")
|
||||
# Earliest doc cutover: per query, document=2026-04-30. Migration commit f78b830 was
|
||||
# 2026-04-26. Most defensible sentinel for "rows that entered pgvector before NOW()
|
||||
# writes were canonical" = the migration commit date.
|
||||
proposed = "2026-04-26T00:00:00Z"
|
||||
reasoning = (
|
||||
"git f78b830 'Migrate to pgvector — remove ChromaDB from api.py, ingest scripts, "
|
||||
"dream.py' is dated 2026-04-26. The earliest type='document' row with a non-NULL "
|
||||
"created_at lands 2026-04-30 (the F11 canonical-encoding cutover). Rows with NULL "
|
||||
"created_at all predate F11 and most predate the pgvector cutover itself. "
|
||||
"2026-04-26 is the date the ChromaDB->pgvector migration script was committed, "
|
||||
"so any row currently in the embeddings table with NULL created_at must have been "
|
||||
"ingested on or after that date (when the table came into existence in current form). "
|
||||
"It is the tightest defensible upper bound on 'the row entered pgvector before "
|
||||
"timestamps were tracked', so it is the right sentinel."
|
||||
)
|
||||
print(f" Proposed sentinel: {proposed}")
|
||||
print(f" Reasoning: {reasoning}")
|
||||
|
||||
return {
|
||||
"earliest_per_type": rows,
|
||||
"git_findings": git_findings,
|
||||
"chromadb_candidates": candidates,
|
||||
"proposed_sentinel": proposed,
|
||||
"reasoning": reasoning,
|
||||
}
|
||||
|
||||
|
||||
# ─── Section 6: 50-row stratified sample ────────────────────────────────────
|
||||
|
||||
def section_6_stratified_sample(cur, sentinel_iso):
|
||||
header("6. 50-ROW STRATIFIED SAMPLE — derived (type, created_at, source)")
|
||||
by_path, by_name = load_watcher_state()
|
||||
|
||||
cohorts = [
|
||||
("A (type NULL, ca NULL)", "type IS NULL AND created_at IS NULL", 10),
|
||||
("B-doc-old (type='document', ca NULL)", "type='document' AND created_at IS NULL", 10),
|
||||
("B-chatgpt (type='chatgpt_conversation', ca NULL)", "type='chatgpt_conversation' AND created_at IS NULL", 10),
|
||||
("C-doc-new (type='document', ca set)", "type='document' AND created_at IS NOT NULL", 10),
|
||||
("C-claude (type='claude_conversation', ca set)", "type='claude_conversation' AND created_at IS NOT NULL", 5),
|
||||
("C-aaronai (type='aaronai_conversation', ca set)", "type='aaronai_conversation' AND created_at IS NOT NULL", 5),
|
||||
]
|
||||
|
||||
samples = []
|
||||
for label, predicate, n in cohorts:
|
||||
sub(f"{label} (sample size: {n})")
|
||||
cur.execute(f"""
|
||||
SELECT id, source, type, created_at, metadata
|
||||
FROM embeddings WHERE {predicate}
|
||||
ORDER BY random() LIMIT %s;
|
||||
""", (n,))
|
||||
rows = cur.fetchall()
|
||||
for r in rows:
|
||||
row_meta = r["metadata"] or {}
|
||||
fp = row_meta.get("filepath")
|
||||
inferred_type = r["type"] or ("document" if (r["source"] or "").lower().endswith(tuple(SUPPORTED_EXT)) else "?")
|
||||
inferred_ca = r["created_at"]
|
||||
inferred_ca_source = "preserved" if inferred_ca else None
|
||||
if not inferred_ca:
|
||||
if fp and Path(fp).exists():
|
||||
inferred_ca = fmt_ts_from_st_mtime(Path(fp))
|
||||
inferred_ca_source = "filepath_stat"
|
||||
elif r["source"] and r["source"] in by_name:
|
||||
candidates = by_name[r["source"]]
|
||||
if len(candidates) == 1:
|
||||
inferred_ca = fmt_ts_from_unix(candidates[0][1])
|
||||
inferred_ca_source = "watcher_state_unique"
|
||||
else:
|
||||
# take most recent
|
||||
latest = max(candidates, key=lambda x: float(x[1]))
|
||||
inferred_ca = fmt_ts_from_unix(latest[1])
|
||||
inferred_ca_source = f"watcher_state_collision_pick_latest_of_{len(candidates)}"
|
||||
else:
|
||||
inferred_ca = sentinel_iso
|
||||
inferred_ca_source = "sentinel"
|
||||
print(f" id={r['id']:<22} src={(r['source'] or '')[:38]:<38}")
|
||||
print(f" existing: type={r['type']!r:<22} ca={r['created_at']!r}")
|
||||
print(f" inferred: type={inferred_type!r:<22} ca={inferred_ca!r} ({inferred_ca_source})")
|
||||
samples.append({
|
||||
"cohort": label, "id": r["id"], "source": r["source"],
|
||||
"existing_type": r["type"], "existing_ca": r["created_at"],
|
||||
"inferred_type": inferred_type, "inferred_ca": inferred_ca,
|
||||
"inferred_ca_source": inferred_ca_source,
|
||||
})
|
||||
return samples
|
||||
|
||||
|
||||
# ─── Driver ─────────────────────────────────────────────────────────────────
|
||||
|
||||
def main():
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
|
||||
out = {"generated_at": datetime.now(timezone.utc).isoformat()}
|
||||
out["section_1"] = section_1_cohort_recap(cur)
|
||||
out["section_2"] = section_2_type_inference(cur)
|
||||
out["section_3"] = section_3_created_at_inference(cur)
|
||||
out["section_4"] = section_4_chatgpt_export(cur)
|
||||
out["section_5"] = section_5_sentinel(cur)
|
||||
sentinel_iso = out["section_5"]["proposed_sentinel"]
|
||||
out["section_6"] = section_6_stratified_sample(cur, sentinel_iso)
|
||||
|
||||
pg.close()
|
||||
|
||||
# JSON sidecar — strip non-serializables.
|
||||
def _serialize(o):
|
||||
if isinstance(o, datetime):
|
||||
return o.isoformat()
|
||||
return str(o)
|
||||
|
||||
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
|
||||
OUT_PATH.write_text(json.dumps(out, indent=2, default=_serialize))
|
||||
print(f"\nJSON sidecar written: {OUT_PATH}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,296 @@
|
||||
"""Read-only analysis of Stage 2 frame data via stage2_frames_v.
|
||||
|
||||
Produces seven sections (frequency, hygiene, per-doc count, co-occurrence,
|
||||
folder cross-tab, worker-version split, data-gap accounting) and writes a JSON
|
||||
sidecar for diffing across runs.
|
||||
|
||||
Usage: venv/bin/python3 scripts/experiments/frame_distribution_report.py
|
||||
"""
|
||||
import os
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from collections import Counter, defaultdict
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
import psycopg2
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv()
|
||||
|
||||
OUT_PATH = Path.home() / "aaronai" / "experiments" / f"frame_distribution_{datetime.now().strftime('%Y-%m-%d')}.json"
|
||||
TOP_K = 20 # for co-occurrence; revisit after seeing the long tail
|
||||
|
||||
|
||||
def normalize(label):
|
||||
return re.sub(r"\s+", " ", label.strip().lower().replace("_", " "))
|
||||
|
||||
|
||||
def folder_bin(source):
|
||||
"""Classify source by type. stage_3_queue stores bare filenames, so we
|
||||
bin by what kind of file it is, not where it lives in the tree."""
|
||||
if not source:
|
||||
return "unknown"
|
||||
if re.match(r"^(Claude|ChatGPT|Aaron AI):", source):
|
||||
return "conversation" # bypasses Stage 2/3, will not appear here
|
||||
s = source.lower()
|
||||
if re.search(r"\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-voice\.md$", s):
|
||||
return "voice_note"
|
||||
if re.search(r"\d{4}-\d{2}-\d{2}-(nrem|early-rem|late-rem|synthesis|lucid)", s):
|
||||
return "dream_output"
|
||||
if s.endswith(".md"):
|
||||
return "markdown"
|
||||
if s.endswith(".pdf"):
|
||||
return "pdf"
|
||||
if s.endswith(".docx") or s.endswith(".doc"):
|
||||
return "docx"
|
||||
if s.endswith(".pptx") or s.endswith(".ppt"):
|
||||
return "pptx"
|
||||
if s.endswith(".txt"):
|
||||
return "txt"
|
||||
return "other"
|
||||
|
||||
|
||||
def fetch_rows(cur):
|
||||
cur.execute("""
|
||||
SELECT source, char_length, active_frames, worker_version, raw_metadata
|
||||
FROM stage2_frames_v
|
||||
""")
|
||||
rows = []
|
||||
for source, char_length, frames, worker_version, raw in cur.fetchall():
|
||||
if not isinstance(frames, list):
|
||||
continue
|
||||
rows.append({
|
||||
"source": source,
|
||||
"char_length": char_length,
|
||||
"frames": [str(f) for f in frames if f],
|
||||
"worker_version": worker_version,
|
||||
"raw_keys": sorted(raw.keys()) if isinstance(raw, dict) else [],
|
||||
})
|
||||
return rows
|
||||
|
||||
|
||||
def section_frequency(rows):
|
||||
counter = Counter()
|
||||
for r in rows:
|
||||
for f in r["frames"]:
|
||||
counter[f] += 1
|
||||
return counter
|
||||
|
||||
|
||||
def section_hygiene(frequency):
|
||||
"""Group raw labels by normalized form; flag collisions."""
|
||||
groups = defaultdict(list)
|
||||
for raw, count in frequency.items():
|
||||
groups[normalize(raw)].append((raw, count))
|
||||
collisions = {k: v for k, v in groups.items() if len(v) > 1}
|
||||
return collisions
|
||||
|
||||
|
||||
def section_per_doc_count(rows):
|
||||
counts = Counter(len(r["frames"]) for r in rows)
|
||||
return counts
|
||||
|
||||
|
||||
def section_cooccurrence(rows, top_frames):
|
||||
top_set = set(top_frames)
|
||||
pair_counts = Counter()
|
||||
for r in rows:
|
||||
present = [f for f in r["frames"] if f in top_set]
|
||||
for i in range(len(present)):
|
||||
for j in range(i + 1, len(present)):
|
||||
a, b = sorted([present[i], present[j]])
|
||||
pair_counts[(a, b)] += 1
|
||||
return pair_counts
|
||||
|
||||
|
||||
def section_folder_crosstab(rows, top_frames):
|
||||
top_set = set(top_frames)
|
||||
table = defaultdict(Counter) # frame -> bin -> count
|
||||
bin_totals = Counter()
|
||||
for r in rows:
|
||||
b = folder_bin(r["source"])
|
||||
bin_totals[b] += 1
|
||||
for f in r["frames"]:
|
||||
if f in top_set:
|
||||
table[f][b] += 1
|
||||
return table, bin_totals
|
||||
|
||||
|
||||
def section_worker_versions(rows):
|
||||
counter = Counter(r["worker_version"] or "unknown" for r in rows)
|
||||
raw_keys_by_version = defaultdict(Counter)
|
||||
for r in rows:
|
||||
v = r["worker_version"] or "unknown"
|
||||
raw_keys_by_version[v][tuple(r["raw_keys"])] += 1
|
||||
return counter, raw_keys_by_version
|
||||
|
||||
|
||||
def section_data_gap(cur):
|
||||
"""Docs that completed Stage 2 but never had frames extracted (<2000 chars)."""
|
||||
cur.execute("""
|
||||
SELECT source, char_length
|
||||
FROM stage_2_queue
|
||||
WHERE completed_at IS NOT NULL AND char_length < 2000
|
||||
""")
|
||||
missing = cur.fetchall()
|
||||
by_bin = Counter(folder_bin(s) for s, _ in missing)
|
||||
char_lengths = [c for _, c in missing]
|
||||
return {
|
||||
"count": len(missing),
|
||||
"by_type_bin": dict(by_bin),
|
||||
"char_length": {
|
||||
"min": min(char_lengths) if char_lengths else None,
|
||||
"max": max(char_lengths) if char_lengths else None,
|
||||
"median": sorted(char_lengths)[len(char_lengths) // 2] if char_lengths else None,
|
||||
},
|
||||
"sample_sources": [s for s, _ in missing[:10]],
|
||||
}
|
||||
|
||||
|
||||
def section_corpus_coverage(cur):
|
||||
"""How much of the embeddings corpus has frame coverage?"""
|
||||
cur.execute("SELECT count(DISTINCT source) FROM embeddings")
|
||||
total = cur.fetchone()[0]
|
||||
cur.execute("""
|
||||
SELECT count(DISTINCT source) FROM embeddings
|
||||
WHERE source LIKE 'Claude:%' OR source LIKE 'ChatGPT:%'
|
||||
OR source LIKE 'Aaron AI:%' OR type='aaronai_conversation'
|
||||
""")
|
||||
conversations = cur.fetchone()[0]
|
||||
cur.execute("SELECT count(DISTINCT source) FROM stage_3_queue WHERE stage2_metadata IS NOT NULL")
|
||||
with_frames = cur.fetchone()[0]
|
||||
cur.execute("""
|
||||
SELECT count(DISTINCT source) FROM stage_2_queue
|
||||
WHERE completed_at IS NOT NULL AND char_length < 2000
|
||||
""")
|
||||
short_no_frames = cur.fetchone()[0]
|
||||
cur.execute("""
|
||||
SELECT count(DISTINCT source) FROM stage_2_queue
|
||||
WHERE failed_at IS NOT NULL
|
||||
""")
|
||||
failed = cur.fetchone()[0]
|
||||
return {
|
||||
"total_distinct_sources_in_embeddings": total,
|
||||
"conversations_no_frames_by_design": conversations,
|
||||
"files_with_frames": with_frames,
|
||||
"files_short_no_frames": short_no_frames,
|
||||
"files_stage2_failed": failed,
|
||||
"frame_coverage_pct": round(100.0 * with_frames / max(total, 1), 1),
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
conn = psycopg2.connect(os.environ["PG_DSN"])
|
||||
cur = conn.cursor()
|
||||
|
||||
rows = fetch_rows(cur)
|
||||
n_docs = len(rows)
|
||||
print(f"=== Stage 2 frame distribution report ({n_docs} docs) ===\n")
|
||||
|
||||
# 1. Frequency
|
||||
freq = section_frequency(rows)
|
||||
print(f"--- 1. Frame frequency ({len(freq)} distinct labels) ---")
|
||||
for label, count in freq.most_common(30):
|
||||
print(f" {count:5d} {label}")
|
||||
print()
|
||||
|
||||
# 2. Hygiene
|
||||
collisions = section_hygiene(freq)
|
||||
print(f"--- 2. Label hygiene (normalized collisions: {len(collisions)}) ---")
|
||||
for norm, variants in sorted(collisions.items(), key=lambda kv: -sum(c for _, c in kv[1])):
|
||||
variant_str = ", ".join(f"{r!r}:{c}" for r, c in sorted(variants, key=lambda x: -x[1]))
|
||||
print(f" '{norm}': {variant_str}")
|
||||
print()
|
||||
|
||||
# 3. Per-doc frame count
|
||||
per_doc = section_per_doc_count(rows)
|
||||
print("--- 3. Per-doc frame count ---")
|
||||
for n in sorted(per_doc):
|
||||
print(f" {n} frames: {per_doc[n]} docs")
|
||||
print()
|
||||
|
||||
# 4. Co-occurrence (top-K)
|
||||
top_frames = [f for f, _ in freq.most_common(TOP_K)]
|
||||
pairs = section_cooccurrence(rows, top_frames)
|
||||
print(f"--- 4. Co-occurrence (top-{TOP_K} frames, top-30 pairs) ---")
|
||||
for (a, b), count in pairs.most_common(30):
|
||||
print(f" {count:4d} {a} × {b}")
|
||||
print()
|
||||
|
||||
# 5. Folder cross-tab
|
||||
crosstab, bin_totals = section_folder_crosstab(rows, top_frames)
|
||||
print(f"--- 5. Frame × folder cross-tab (top-{TOP_K} frames) ---")
|
||||
bins_sorted = [b for b, _ in bin_totals.most_common()]
|
||||
print(f" bins (with totals): " + ", ".join(f"{b}({n})" for b, n in bin_totals.most_common(10)))
|
||||
for f in top_frames:
|
||||
row_data = crosstab[f]
|
||||
if not row_data:
|
||||
continue
|
||||
cells = ", ".join(f"{b}={c}" for b, c in row_data.most_common(5))
|
||||
print(f" {f}: {cells}")
|
||||
print()
|
||||
|
||||
# 6. Worker versions
|
||||
versions, keys_by_version = section_worker_versions(rows)
|
||||
print("--- 6. Worker version split ---")
|
||||
for v, count in versions.most_common():
|
||||
print(f" v{v}: {count} docs")
|
||||
top_shapes = keys_by_version[v].most_common(3)
|
||||
for keys, kcount in top_shapes:
|
||||
print(f" {kcount} docs with keys={list(keys)}")
|
||||
print()
|
||||
|
||||
# 7. Data gap
|
||||
gap = section_data_gap(cur)
|
||||
print("--- 7. Data-gap accounting (Stage 2 docs <2000 chars; never frame-extracted) ---")
|
||||
print(f" count: {gap['count']}")
|
||||
print(f" char_length: min={gap['char_length']['min']}, median={gap['char_length']['median']}, max={gap['char_length']['max']}")
|
||||
print(f" by type bin: {gap['by_type_bin']}")
|
||||
print(f" sample sources: {gap['sample_sources']}")
|
||||
print()
|
||||
|
||||
# 8. Corpus coverage
|
||||
coverage = section_corpus_coverage(cur)
|
||||
print("--- 8. Corpus-wide frame coverage ---")
|
||||
print(f" total distinct sources in embeddings: {coverage['total_distinct_sources_in_embeddings']}")
|
||||
print(f" conversations (no frames by design): {coverage['conversations_no_frames_by_design']}")
|
||||
print(f" files with frames: {coverage['files_with_frames']}")
|
||||
print(f" files short, no frames: {coverage['files_short_no_frames']}")
|
||||
print(f" files Stage 2 failed: {coverage['files_stage2_failed']}")
|
||||
print(f" frame coverage: {coverage['frame_coverage_pct']}% of corpus")
|
||||
print()
|
||||
|
||||
# JSON sidecar
|
||||
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
|
||||
sidecar = {
|
||||
"generated_at": datetime.now().isoformat(),
|
||||
"n_docs_with_frames": n_docs,
|
||||
"n_distinct_labels": len(freq),
|
||||
"top_30_frames": freq.most_common(30),
|
||||
"label_collisions": {
|
||||
k: [(r, c) for r, c in v] for k, v in collisions.items()
|
||||
},
|
||||
"per_doc_frame_count": dict(per_doc),
|
||||
"top_30_pairs": [
|
||||
{"a": a, "b": b, "count": c}
|
||||
for (a, b), c in pairs.most_common(30)
|
||||
],
|
||||
"folder_crosstab": {
|
||||
f: dict(crosstab[f]) for f in top_frames if crosstab[f]
|
||||
},
|
||||
"bin_totals": dict(bin_totals),
|
||||
"worker_versions": dict(versions),
|
||||
"data_gap": gap,
|
||||
"corpus_coverage": coverage,
|
||||
}
|
||||
OUT_PATH.write_text(json.dumps(sidecar, indent=2, default=str))
|
||||
print(f"JSON sidecar written: {OUT_PATH}")
|
||||
|
||||
cur.close()
|
||||
conn.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -75,6 +75,17 @@ async def lifespan(app: FastAPI):
|
||||
max_coroutines=2,
|
||||
)
|
||||
await graphiti_instance.build_indices_and_constraints()
|
||||
# Bridge driver._search_ops to driver.search_interface — graphiti-core 0.29.0
|
||||
# builds FalkorSearchOperations as driver._search_ops in FalkorDriver.__init__
|
||||
# but never assigns it to driver.search_interface. search_utils.py dispatches
|
||||
# on driver.search_interface; without this assignment it falls back to
|
||||
# interpreted-Cypher cosine math (full table scans). Together with the
|
||||
# vendored patches in graphiti_patches/, this activates FalkorDB's native
|
||||
# vector index for entity dedup similarity search.
|
||||
if (hasattr(graphiti_instance.driver, "_search_ops")
|
||||
and graphiti_instance.driver.search_interface is None):
|
||||
graphiti_instance.driver.search_interface = graphiti_instance.driver._search_ops
|
||||
log.info("Wired driver.search_interface = driver._search_ops (vector index path active)")
|
||||
log.info(f"Graphiti ready — provider: {LLM_PROVIDER}, group: {GROUP_ID}")
|
||||
yield
|
||||
await graphiti_instance.close()
|
||||
|
||||
+25
-6
@@ -15,7 +15,7 @@ from dotenv import load_dotenv
|
||||
import psycopg2
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
from encoding import extract_text, chunk_and_embed, write_embeddings_batch, SUPPORTED
|
||||
from encoding import extract_blocks, chunk_and_embed, write_embeddings_batch, SUPPORTED
|
||||
from failures import (
|
||||
record_ingest_failure as _record_failure_sql,
|
||||
resolve_ingest_failure as _resolve_failure_sql,
|
||||
@@ -77,14 +77,29 @@ def _resolve_failure(source: str) -> None:
|
||||
print(f" Could not resolve ingest failure record (non-fatal): {e}")
|
||||
|
||||
|
||||
IGNORED_TOP_FOLDERS = {"Drafts"}
|
||||
|
||||
|
||||
def _ingest_one(filepath: Path, embedder, root: Path = None) -> int:
|
||||
"""Ingest a single file. Returns chunk count, 0 on skip/failure."""
|
||||
if filepath.name.startswith(("~$", ".")):
|
||||
# "~" catches Office lock files (~$) including the case where Nextcloud
|
||||
# filesystem encoding has mangled the "$" to a unicode replacement char.
|
||||
if filepath.name.startswith(("~", ".")):
|
||||
return 0
|
||||
if filepath.suffix.lower() not in SUPPORTED:
|
||||
return 0
|
||||
text = extract_text(filepath)
|
||||
if not text.strip():
|
||||
if root is not None:
|
||||
try:
|
||||
rel = filepath.parent.relative_to(root)
|
||||
if rel.parts and rel.parts[0] in IGNORED_TOP_FOLDERS:
|
||||
return 0
|
||||
except ValueError:
|
||||
pass
|
||||
blocks = extract_blocks(filepath)
|
||||
if not blocks or not any(
|
||||
(b.get("text") or "").strip() or (b.get("heading") or "").strip()
|
||||
for b in blocks
|
||||
):
|
||||
_record_failure(filepath, "Text extraction failed or empty")
|
||||
return 0
|
||||
folder_rel = None
|
||||
@@ -94,7 +109,7 @@ def _ingest_one(filepath: Path, embedder, root: Path = None) -> int:
|
||||
except ValueError:
|
||||
pass
|
||||
try:
|
||||
rows = chunk_and_embed(text, filepath.name, embedder,
|
||||
rows = chunk_and_embed(blocks, filepath.name, embedder,
|
||||
filepath=filepath, folder=folder_rel)
|
||||
except Exception as e:
|
||||
_record_failure(filepath, f"Embedding failed: {e}")
|
||||
@@ -113,7 +128,11 @@ def _ingest_one(filepath: Path, embedder, root: Path = None) -> int:
|
||||
print(f" Indexed {len(rows)} chunks: {filepath.name}")
|
||||
_resolve_failure(filepath.name)
|
||||
if not os.getenv("SKIP_STAGE2_ENQUEUE"):
|
||||
enqueue_stage2(filepath.name, text)
|
||||
full_text = "\n".join(
|
||||
f"{b['heading']}\n{b['text']}" if b.get("heading") else b.get("text", "")
|
||||
for b in blocks
|
||||
)
|
||||
enqueue_stage2(filepath.name, full_text)
|
||||
return len(rows)
|
||||
|
||||
|
||||
|
||||
@@ -18,8 +18,14 @@ CONVERSATIONS_DB = str(Path.home() / "aaronai" / "conversations.db")
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
MIN_EXCHANGES = 3
|
||||
|
||||
_embedder = None
|
||||
|
||||
def get_embedder():
|
||||
global _embedder
|
||||
if _embedder is None:
|
||||
print("Loading embedding model...")
|
||||
embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
||||
_embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
||||
return _embedder
|
||||
|
||||
def get_conversations():
|
||||
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||
@@ -123,9 +129,18 @@ def run():
|
||||
|
||||
# Embed and insert
|
||||
texts = [c[1] for c in new_chunks]
|
||||
embeddings = embedder.encode(texts, show_progress_bar=False).tolist()
|
||||
embeddings = get_embedder().encode(texts, show_progress_bar=False).tolist()
|
||||
|
||||
for (chunk_id, chunk_text, meta), embedding in zip(new_chunks, embeddings):
|
||||
if not meta.get("type"):
|
||||
raise ValueError(
|
||||
f"chunk {chunk_id!r} missing 'type'; writers must supply it "
|
||||
f"(see Improvement #2 in docs/birdai-component-inventory)"
|
||||
)
|
||||
# ON CONFLICT below intentionally overwrites created_at (unlike encoding.py's
|
||||
# COALESCE): an Aaron-AI conversation's created_at tracks convo.updated_at,
|
||||
# which advances on activity. Re-running this script on an active conv
|
||||
# should refresh the timestamp, not preserve the first-seen one.
|
||||
cur.execute("""
|
||||
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
|
||||
VALUES (%s, %s, %s::vector, %s, %s, %s, %s)
|
||||
|
||||
@@ -0,0 +1,136 @@
|
||||
"""
|
||||
Orientation Indexer — feeds Stage 2's document-level orientations into pgvector
|
||||
so they're searchable alongside chunk text by the retrieve_documents tool.
|
||||
|
||||
Each completed row in stage_3_queue has an `orientation` string (active_frames
|
||||
+ frame_relationships + extraction_orientation + one_sentence_summary) that
|
||||
describes the document at a conceptual level. Indexing it as its own row in
|
||||
the embeddings table gives the cross-encoder a second surface to rank against
|
||||
— "what is this document about" rather than just "what does this chunk say."
|
||||
|
||||
This worker is part of the "read-only Graphiti + orientation-into-pgvector"
|
||||
plan B that replaced the Stage 3 → Graphiti write path. The graph layer is
|
||||
queried directly via the search_facts chat tool; orientations land here.
|
||||
|
||||
State tracking: a row is considered indexed if the embeddings table already
|
||||
holds a row with source=<source> and metadata->>'kind'='orientation'. The
|
||||
worker is idempotent — restart-safe, resumable.
|
||||
|
||||
Runs as systemd: aaronai-orientation-indexer.service
|
||||
"""
|
||||
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
from dotenv import load_dotenv
|
||||
import psycopg2
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from encoding import write_embeddings_batch
|
||||
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
EMBED_MODEL = "all-MiniLM-L6-v2"
|
||||
BATCH_SIZE = 25
|
||||
POLL_INTERVAL_SECS = 30
|
||||
LOG_FILE = "/var/log/aaronai/orientation-indexer.log"
|
||||
HEARTBEAT_FILE = "/var/log/aaronai/orientation-indexer-heartbeat"
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s [orientation-indexer] %(levelname)s %(message)s",
|
||||
handlers=[logging.FileHandler(LOG_FILE, mode="a")],
|
||||
)
|
||||
log = logging.getLogger("orientation-indexer")
|
||||
|
||||
|
||||
def get_pg():
|
||||
return psycopg2.connect(PG_DSN)
|
||||
|
||||
|
||||
def fetch_unindexed(cur, limit):
|
||||
"""Pull stage_3_queue rows with a non-null orientation whose orientation
|
||||
hasn't been written to the embeddings table yet."""
|
||||
cur.execute(
|
||||
"""
|
||||
SELECT s.source, s.orientation
|
||||
FROM stage_3_queue s
|
||||
WHERE s.orientation IS NOT NULL
|
||||
AND NOT EXISTS (
|
||||
SELECT 1 FROM embeddings e
|
||||
WHERE e.source = s.source
|
||||
AND e.metadata->>'kind' = 'orientation'
|
||||
)
|
||||
ORDER BY s.enqueued_at
|
||||
LIMIT %s
|
||||
""",
|
||||
(limit,),
|
||||
)
|
||||
return cur.fetchall()
|
||||
|
||||
|
||||
def _row_for(source: str, orientation: str, embedding) -> dict:
|
||||
"""Build an embeddings row for the orientation. id is deterministic so
|
||||
re-runs don't create duplicates if the unique check above ever races."""
|
||||
import hashlib
|
||||
chunk_id = hashlib.md5(f"orientation:{source}".encode()).hexdigest()[:8] + "_orient"
|
||||
return {
|
||||
"id": chunk_id,
|
||||
"document": orientation,
|
||||
"embedding": embedding,
|
||||
"source": source,
|
||||
"type": "document",
|
||||
"metadata": {
|
||||
"source": source,
|
||||
"kind": "orientation",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def write_heartbeat():
|
||||
try:
|
||||
Path(HEARTBEAT_FILE).write_text(str(time.time()))
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
def main():
|
||||
log.info("Orientation indexer starting...")
|
||||
log.info(f"Loading embedding model: {EMBED_MODEL}")
|
||||
embedder = SentenceTransformer(EMBED_MODEL)
|
||||
log.info("Embedding model ready.")
|
||||
|
||||
while True:
|
||||
write_heartbeat()
|
||||
try:
|
||||
pg = get_pg()
|
||||
try:
|
||||
cur = pg.cursor()
|
||||
rows = fetch_unindexed(cur, BATCH_SIZE)
|
||||
if not rows:
|
||||
pg.close()
|
||||
time.sleep(POLL_INTERVAL_SECS)
|
||||
continue
|
||||
|
||||
orientations = [r[1] for r in rows]
|
||||
embeddings = embedder.encode(orientations).tolist()
|
||||
batch = [
|
||||
_row_for(source, orient, emb)
|
||||
for (source, orient), emb in zip(rows, embeddings)
|
||||
]
|
||||
write_embeddings_batch(pg, batch)
|
||||
log.info(f"Indexed {len(batch)} orientation(s)")
|
||||
finally:
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
log.error(f"Indexing loop iteration failed: {e}")
|
||||
time.sleep(POLL_INTERVAL_SECS)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,146 @@
|
||||
"""One-off: re-ingest docx+pptx after the 2026-05-04 extractor upgrade (commit 93c0d89).
|
||||
|
||||
Pre-upgrade extraction missed tables, headers/footers, text boxes, group shapes,
|
||||
and pptx notes — leaving CVs/dossiers as section-header skeletons in the index.
|
||||
|
||||
Steps when run with --apply:
|
||||
1. DELETE all embeddings rows where source ends in .docx or .pptx
|
||||
2. Walk NEXTCLOUD_PATH and re-ingest every .docx/.pptx via _ingest_one
|
||||
3. Stage 2 enqueue is suppressed (SKIP_STAGE2_ENQUEUE=1)
|
||||
|
||||
Without --apply: dry-run. Counts files and chunks, prints a sample, writes nothing.
|
||||
"""
|
||||
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
os.environ["SKIP_STAGE2_ENQUEUE"] = "1"
|
||||
|
||||
from dotenv import load_dotenv
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
import psycopg2
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from ingest import _ingest_one, get_pg
|
||||
|
||||
NEXTCLOUD_PATH = Path("/home/aaron/nextcloud/data/data/aaron/files")
|
||||
|
||||
APPLY = "--apply" in sys.argv
|
||||
_ext_args = [a for a in sys.argv[1:] if a.startswith("--ext=")]
|
||||
if _ext_args:
|
||||
TARGET_EXTS = {("." + e.lstrip(".")) for arg in _ext_args
|
||||
for e in arg.split("=", 1)[1].split(",")}
|
||||
else:
|
||||
TARGET_EXTS = {".docx", ".pptx"}
|
||||
|
||||
|
||||
def _ext_regex():
|
||||
inner = "|".join(re.escape(e.lstrip(".")) for e in sorted(TARGET_EXTS))
|
||||
return f"\\.({inner})$"
|
||||
|
||||
|
||||
def count_stale():
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute(
|
||||
f"SELECT lower(substring(source from '\\.[^.]+$')) AS ext, "
|
||||
f"COUNT(DISTINCT source) AS files, COUNT(*) AS chunks "
|
||||
f"FROM embeddings WHERE lower(source) ~ '{_ext_regex()}' "
|
||||
f"GROUP BY 1 ORDER BY 1"
|
||||
)
|
||||
rows = cur.fetchall()
|
||||
pg.close()
|
||||
return rows
|
||||
|
||||
|
||||
def delete_stale():
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute(f"DELETE FROM embeddings WHERE lower(source) ~ '{_ext_regex()}'")
|
||||
deleted = cur.rowcount
|
||||
pg.commit()
|
||||
pg.close()
|
||||
return deleted
|
||||
|
||||
|
||||
def find_files():
|
||||
files = []
|
||||
for f in NEXTCLOUD_PATH.rglob("*"):
|
||||
if not f.is_file():
|
||||
continue
|
||||
if f.suffix.lower() not in TARGET_EXTS:
|
||||
continue
|
||||
if f.name.startswith(("~$", ".")):
|
||||
continue
|
||||
files.append(f)
|
||||
return files
|
||||
|
||||
|
||||
def main():
|
||||
print(f"Mode: {'APPLY (destructive)' if APPLY else 'DRY-RUN (no writes)'}")
|
||||
print(f"Target: {NEXTCLOUD_PATH}")
|
||||
print(f"Extensions: {sorted(TARGET_EXTS)}")
|
||||
print(f"SKIP_STAGE2_ENQUEUE={os.environ.get('SKIP_STAGE2_ENQUEUE')}")
|
||||
print()
|
||||
|
||||
print("Stale chunks currently in DB:")
|
||||
for ext, files, chunks in count_stale():
|
||||
print(f" {ext}: {files} files, {chunks} chunks")
|
||||
print()
|
||||
|
||||
files = find_files()
|
||||
by_ext = {}
|
||||
for f in files:
|
||||
by_ext.setdefault(f.suffix.lower(), []).append(f)
|
||||
print(f"Files on disk to re-ingest:")
|
||||
for ext, lst in sorted(by_ext.items()):
|
||||
print(f" {ext}: {len(lst)} files")
|
||||
print(f" total: {len(files)}")
|
||||
print()
|
||||
print("Sample (5 random):")
|
||||
import random
|
||||
for f in random.sample(files, min(5, len(files))):
|
||||
print(f" {f}")
|
||||
print()
|
||||
|
||||
if not APPLY:
|
||||
print("Dry-run only. Re-run with --apply to delete + re-ingest.")
|
||||
return
|
||||
|
||||
print("Deleting stale chunks...")
|
||||
n = delete_stale()
|
||||
print(f" deleted {n} rows")
|
||||
print()
|
||||
|
||||
print("Loading embedder...")
|
||||
embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
||||
print()
|
||||
|
||||
print(f"Re-ingesting {len(files)} files...")
|
||||
started = time.time()
|
||||
ingested = failed = total_chunks = 0
|
||||
for i, f in enumerate(files, 1):
|
||||
n = _ingest_one(f, embedder, root=NEXTCLOUD_PATH)
|
||||
if n > 0:
|
||||
ingested += 1
|
||||
total_chunks += n
|
||||
else:
|
||||
failed += 1
|
||||
if i % 25 == 0 or i == len(files):
|
||||
elapsed = time.time() - started
|
||||
rate = i / elapsed if elapsed else 0
|
||||
print(f" [{i}/{len(files)}] ingested={ingested} failed={failed} "
|
||||
f"chunks={total_chunks} ({rate:.1f} files/s)")
|
||||
elapsed = time.time() - started
|
||||
print()
|
||||
print(f"Done in {elapsed:.0f}s: {ingested} ingested, {failed} failed, "
|
||||
f"{total_chunks} chunks written.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,123 @@
|
||||
"""One-off: remove embeddings rows that no longer correspond to a file on disk.
|
||||
|
||||
Two passes:
|
||||
1. Modern rows (metadata.filepath set): check each filepath, delete if missing.
|
||||
2. Legacy rows (metadata.filepath null): build a set of all basenames present
|
||||
anywhere under NEXTCLOUD_PATH, then delete rows whose `source` basename
|
||||
isn't in that set.
|
||||
|
||||
Default mode is a dry-run (counts + sample paths, no writes). Pass --apply to
|
||||
actually delete.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from collections import defaultdict
|
||||
|
||||
from dotenv import load_dotenv
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
import psycopg2
|
||||
|
||||
NEXTCLOUD_PATH = Path("/home/aaron/nextcloud/data/data/aaron/files")
|
||||
APPLY = "--apply" in sys.argv
|
||||
|
||||
|
||||
def get_pg():
|
||||
return psycopg2.connect(os.environ["PG_DSN"])
|
||||
|
||||
|
||||
def scan_modern_orphans():
|
||||
"""Rows with metadata.filepath whose file doesn't exist on disk."""
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute(
|
||||
"SELECT id, source, metadata->>'filepath' AS filepath "
|
||||
"FROM embeddings WHERE metadata->>'filepath' IS NOT NULL"
|
||||
)
|
||||
orphans = []
|
||||
by_source = defaultdict(int)
|
||||
for row in cur.fetchall():
|
||||
fp = row[2]
|
||||
if fp and not Path(fp).exists():
|
||||
orphans.append(row)
|
||||
by_source[row[1]] += 1
|
||||
pg.close()
|
||||
return orphans, by_source
|
||||
|
||||
|
||||
def scan_legacy_orphans():
|
||||
"""Rows without metadata.filepath whose basename isn't anywhere under
|
||||
NEXTCLOUD_PATH. Restricted to type='document' so conversations and memory
|
||||
snapshots (which are synthetic sources, not files on disk) aren't flagged
|
||||
as orphans. Walks the filesystem once to build the basename set."""
|
||||
print(f" walking {NEXTCLOUD_PATH} to build basename index...")
|
||||
on_disk = set()
|
||||
for p in NEXTCLOUD_PATH.rglob("*"):
|
||||
if p.is_file():
|
||||
on_disk.add(p.name)
|
||||
print(f" {len(on_disk):,} files on disk")
|
||||
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute(
|
||||
"SELECT id, source FROM embeddings "
|
||||
"WHERE metadata->>'filepath' IS NULL AND type = 'document'"
|
||||
)
|
||||
orphans = []
|
||||
by_source = defaultdict(int)
|
||||
for row in cur.fetchall():
|
||||
if row[1] not in on_disk:
|
||||
orphans.append(row)
|
||||
by_source[row[1]] += 1
|
||||
pg.close()
|
||||
return orphans, by_source
|
||||
|
||||
|
||||
def delete_rows(ids):
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute("DELETE FROM embeddings WHERE id = ANY(%s)", (list(ids),))
|
||||
deleted = cur.rowcount
|
||||
pg.commit()
|
||||
pg.close()
|
||||
return deleted
|
||||
|
||||
|
||||
def main():
|
||||
print(f"Mode: {'APPLY (destructive)' if APPLY else 'DRY-RUN (no writes)'}")
|
||||
print(f"Target: {NEXTCLOUD_PATH}")
|
||||
print()
|
||||
|
||||
print("Pass 1 — modern rows (metadata.filepath set):")
|
||||
modern, modern_by_src = scan_modern_orphans()
|
||||
print(f" {len(modern):,} orphan rows across {len(modern_by_src):,} files")
|
||||
for src, n in sorted(modern_by_src.items(), key=lambda kv: -kv[1])[:10]:
|
||||
print(f" {n:>4} chunks — {src}")
|
||||
print()
|
||||
|
||||
print("Pass 2 — legacy rows (no metadata.filepath):")
|
||||
legacy, legacy_by_src = scan_legacy_orphans()
|
||||
print(f" {len(legacy):,} orphan rows across {len(legacy_by_src):,} files")
|
||||
for src, n in sorted(legacy_by_src.items(), key=lambda kv: -kv[1])[:10]:
|
||||
print(f" {n:>4} chunks — {src}")
|
||||
print()
|
||||
|
||||
total = len(modern) + len(legacy)
|
||||
if total == 0:
|
||||
print("Nothing to delete.")
|
||||
return
|
||||
|
||||
if not APPLY:
|
||||
print(f"Dry-run only. Re-run with --apply to delete {total:,} rows.")
|
||||
return
|
||||
|
||||
print(f"Deleting {total:,} orphan rows...")
|
||||
n1 = delete_rows([r[0] for r in modern]) if modern else 0
|
||||
n2 = delete_rows([r[0] for r in legacy]) if legacy else 0
|
||||
print(f" modern: {n1:,} legacy: {n2:,} total: {n1 + n2:,}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,53 @@
|
||||
"""End-to-end test of retrieve_context with intent routing + reranking.
|
||||
|
||||
Avoids loading the full FastAPI app; replicates the chat-handler retrieval
|
||||
call shape and prints classifier output + final ranked sources for each query.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
from dotenv import load_dotenv
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
# Stub anthropic so api.py import doesn't fail without the SDK loaded.
|
||||
# We only need retrieve_context.
|
||||
import types
|
||||
sys.modules.setdefault("anthropic", types.ModuleType("anthropic"))
|
||||
sys.modules["anthropic"].Anthropic = lambda **kw: None
|
||||
|
||||
# Same for whisper if present
|
||||
if "faster_whisper" not in sys.modules:
|
||||
sys.modules["faster_whisper"] = types.ModuleType("faster_whisper")
|
||||
|
||||
import importlib.util
|
||||
spec = importlib.util.spec_from_file_location("api", Path(__file__).parent / "api.py")
|
||||
api = importlib.util.module_from_spec(spec)
|
||||
# Don't execute the whole module (it starts FastAPI). Instead, exec only definitions.
|
||||
# Easier: just import the functions we need by exec'ing the file but catching errors.
|
||||
try:
|
||||
spec.loader.exec_module(api)
|
||||
except Exception as e:
|
||||
print(f"(continuing despite api.py side-effect error: {e})")
|
||||
|
||||
retrieve_context = api.retrieve_context
|
||||
|
||||
QUERIES = [
|
||||
"write me a bio",
|
||||
"my professional bio",
|
||||
"Aaron Nelson CV consulting and design work",
|
||||
"FWN3D consulting",
|
||||
"syllabi I have taught",
|
||||
"philosophy of teaching",
|
||||
"Hudson Valley Additive Manufacturing Center",
|
||||
"Aaron Nelson is an artist and educator working in additive manufacturing",
|
||||
]
|
||||
|
||||
for q in QUERIES:
|
||||
pieces, sources = retrieve_context(q)
|
||||
print(f"\n=== {q!r} ===")
|
||||
for i, src in enumerate(sources, 1):
|
||||
print(f" {i}. {src}")
|
||||
+116
-9
@@ -29,7 +29,7 @@ from sentence_transformers import SentenceTransformer
|
||||
from watchdog.observers import Observer
|
||||
from watchdog.events import FileSystemEventHandler
|
||||
|
||||
from encoding import extract_text, chunk_and_embed, write_embeddings_batch, SUPPORTED
|
||||
from encoding import extract_blocks, chunk_and_embed, write_embeddings_batch, SUPPORTED
|
||||
from failures import (
|
||||
record_ingest_failure as _record_failure_sql,
|
||||
resolve_ingest_failure as _resolve_failure_sql,
|
||||
@@ -123,13 +123,61 @@ def resolve_ingest_failure(source: str):
|
||||
log.warning(f"Could not resolve ingest failure record (non-fatal): {e}")
|
||||
|
||||
|
||||
def delete_embeddings_for_path(filepath: Path):
|
||||
"""Remove embeddings rows for a file that no longer exists. Matches by
|
||||
metadata.filepath so multi-folder same-basename files don't collide.
|
||||
Legacy rows without filepath metadata are left alone — they get cleaned
|
||||
by sweep_orphans.py."""
|
||||
try:
|
||||
pg = get_pg()
|
||||
try:
|
||||
cur = pg.cursor()
|
||||
cur.execute(
|
||||
"DELETE FROM embeddings WHERE metadata->>'filepath' = %s",
|
||||
(str(filepath),),
|
||||
)
|
||||
deleted = cur.rowcount
|
||||
pg.commit()
|
||||
if deleted:
|
||||
log.info(f"Deleted {deleted} chunks for removed file: {filepath}")
|
||||
finally:
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
log.warning(f"Could not delete embeddings for {filepath} (non-fatal): {e}")
|
||||
|
||||
|
||||
def remove_from_state(filepath: Path):
|
||||
"""Drop a deleted file from watcher_state.json so it isn't carried as
|
||||
'known mtime' indefinitely."""
|
||||
try:
|
||||
state = load_state()
|
||||
key = str(filepath)
|
||||
if key in state:
|
||||
del state[key]
|
||||
save_state(state)
|
||||
except Exception as e:
|
||||
log.warning(f"Could not update state for deleted {filepath} (non-fatal): {e}")
|
||||
|
||||
|
||||
IGNORED_TOP_FOLDERS = {"Drafts"}
|
||||
|
||||
|
||||
def ingest_file(filepath: Path, embedder) -> int:
|
||||
if filepath.name.startswith(("~$", ".")):
|
||||
if filepath.name.startswith(("~$", "~", ".")):
|
||||
return 0
|
||||
if filepath.suffix.lower() not in SUPPORTED:
|
||||
return 0
|
||||
text = extract_text(filepath)
|
||||
if not text.strip():
|
||||
try:
|
||||
rel = filepath.parent.relative_to(NEXTCLOUD_PATH)
|
||||
if rel.parts and rel.parts[0] in IGNORED_TOP_FOLDERS:
|
||||
return 0
|
||||
except ValueError:
|
||||
pass
|
||||
blocks = extract_blocks(filepath)
|
||||
if not blocks or not any(
|
||||
(b.get("text") or "").strip() or (b.get("heading") or "").strip()
|
||||
for b in blocks
|
||||
):
|
||||
record_ingest_failure(filepath, "Text extraction failed or empty")
|
||||
return 0
|
||||
folder_rel = None
|
||||
@@ -138,7 +186,7 @@ def ingest_file(filepath: Path, embedder) -> int:
|
||||
except ValueError:
|
||||
pass
|
||||
try:
|
||||
rows = chunk_and_embed(text, filepath.name, embedder,
|
||||
rows = chunk_and_embed(blocks, filepath.name, embedder,
|
||||
filepath=filepath, folder=folder_rel)
|
||||
except Exception as e:
|
||||
log.error(f"Embedding failed for {filepath.name}: {e}")
|
||||
@@ -159,7 +207,11 @@ def ingest_file(filepath: Path, embedder) -> int:
|
||||
return 0
|
||||
log.info(f"Indexed {len(rows)} chunks: {filepath.name}")
|
||||
resolve_ingest_failure(source)
|
||||
enqueue_stage2(source, text)
|
||||
full_text = "\n".join(
|
||||
f"{b['heading']}\n{b['text']}" if b.get("heading") else b.get("text", "")
|
||||
for b in blocks
|
||||
)
|
||||
enqueue_stage2(source, full_text)
|
||||
return len(rows)
|
||||
|
||||
|
||||
@@ -168,6 +220,7 @@ def ingest_files(paths: list, embedder, state: dict) -> dict:
|
||||
for path in paths:
|
||||
count = ingest_file(path, embedder)
|
||||
total += count
|
||||
if count > 0:
|
||||
state[str(path)] = str(path.stat().st_mtime)
|
||||
log.info(f"Ingestion complete. {total} chunks across {len(paths)} files.")
|
||||
return state
|
||||
@@ -196,12 +249,24 @@ def get_changed_files(state: dict) -> list:
|
||||
continue
|
||||
if path.suffix.lower() not in SUPPORTED:
|
||||
continue
|
||||
if path.name.startswith((".", "~$")):
|
||||
if path.name.startswith((".", "~$", "~")):
|
||||
continue
|
||||
if "Admin/Backups" in str(path) or "Backups" in path.parts:
|
||||
continue
|
||||
if "Journal/Media" in str(path):
|
||||
continue
|
||||
if "Generative Design" in path.parts and "Processing" in path.parts:
|
||||
continue
|
||||
if "Computational Design 2017" in path.parts and "Student Work" in path.parts:
|
||||
continue
|
||||
if path.name in ("Renders.pptx", "Ribbon Cutting Slideshow.pptx") \
|
||||
and "Presentations" in path.parts:
|
||||
continue
|
||||
if path.name == "GH Slicer Notes [Autosaved].pptx" \
|
||||
and "DDF555 3D Computational" in path.parts:
|
||||
continue
|
||||
if path.stat().st_size == 0:
|
||||
continue
|
||||
if state.get(str(path)) != str(path.stat().st_mtime):
|
||||
changed.append(path)
|
||||
return changed
|
||||
@@ -280,12 +345,22 @@ class IngestHandler(FileSystemEventHandler):
|
||||
self.last_event = 0
|
||||
|
||||
def _should_ignore(self, path: Path) -> bool:
|
||||
if path.name.startswith((".", "~$")):
|
||||
if path.name.startswith((".", "~$", "~")):
|
||||
return True
|
||||
if "Admin/Backups" in str(path) or "Backups" in path.parts:
|
||||
return True
|
||||
if "Journal/Media" in str(path):
|
||||
return True
|
||||
if "Generative Design" in path.parts and "Processing" in path.parts:
|
||||
return True
|
||||
if "Computational Design 2017" in path.parts and "Student Work" in path.parts:
|
||||
return True
|
||||
if path.name in ("Renders.pptx", "Ribbon Cutting Slideshow.pptx") \
|
||||
and "Presentations" in path.parts:
|
||||
return True
|
||||
if path.name == "GH Slicer Notes [Autosaved].pptx" \
|
||||
and "DDF555 3D Computational" in path.parts:
|
||||
return True
|
||||
return False
|
||||
|
||||
def on_created(self, event):
|
||||
@@ -311,15 +386,47 @@ class IngestHandler(FileSystemEventHandler):
|
||||
def on_moved(self, event):
|
||||
if event.is_directory:
|
||||
return
|
||||
src = Path(event.src_path)
|
||||
dest = Path(event.dest_path)
|
||||
# If destination is outside NEXTCLOUD_PATH (e.g., Nextcloud trashbin at
|
||||
# /home/aaron/nextcloud/data/data/aaron/files_trashbin/), treat as a
|
||||
# delete — the file is no longer in the watched corpus.
|
||||
try:
|
||||
dest.relative_to(NEXTCLOUD_PATH)
|
||||
except ValueError:
|
||||
if src.suffix.lower() in SUPPORTED:
|
||||
log.info(f"Event: moved out of tree {src} -> {dest}")
|
||||
threading.Thread(
|
||||
target=lambda: (
|
||||
delete_embeddings_for_path(src),
|
||||
remove_from_state(src),
|
||||
),
|
||||
daemon=True,
|
||||
).start()
|
||||
return
|
||||
# Nextcloud WebDAV writes .part temp files then renames to final path.
|
||||
# src_path is the .part file; dest_path is the final filename.
|
||||
dest = Path(event.dest_path)
|
||||
if dest.suffix.lower() not in SUPPORTED or self._should_ignore(dest):
|
||||
return
|
||||
log.info(f"Event: moved -> {dest}")
|
||||
self.pending = True
|
||||
self.last_event = time.time()
|
||||
|
||||
def on_deleted(self, event):
|
||||
if event.is_directory:
|
||||
return
|
||||
path = Path(event.src_path)
|
||||
if path.suffix.lower() not in SUPPORTED:
|
||||
return
|
||||
log.info(f"Event: deleted {path}")
|
||||
threading.Thread(
|
||||
target=lambda: (
|
||||
delete_embeddings_for_path(path),
|
||||
remove_from_state(path),
|
||||
),
|
||||
daemon=True,
|
||||
).start()
|
||||
|
||||
def on_closed(self, event):
|
||||
# FileClosedEvent fires on the final file after Nextcloud completes write.
|
||||
# Belt-and-suspenders catch for any write pattern not caught by on_moved.
|
||||
|
||||
Reference in New Issue
Block a user