Commit Graph

102 Commits

Author SHA1 Message Date
aaron 0a1e2b4f61 api.py: preview-then-commit flow for save_document
The previous system prompt instructed Claude to skip duplicating document
content in chat and write the file directly. That produced no-preview UX:
the user asked for a bio and the docx appeared in Drafts/ before they had
a chance to read or refine it. Reversed: Claude now drafts in chat first,
waits for an explicit save signal, and only then calls save_document. The
explicit "skip preview" escape hatch is preserved for one-shot flows.
2026-05-20 01:01:45 +00:00
aaron 8c2c597687 api.py: save_document — distinguish PATH miss from missing install in error
The systemd unit pins PATH to the venv only, so subprocess.run(['pandoc', ...])
raised FileNotFoundError even though pandoc was installed at /usr/bin/pandoc.
The handler's "pandoc not installed" message was misleading — pandoc was
reachable from a login shell but not from the service. Rephrased to point at
the actual cause: the service's PATH. The systemd drop-in to extend PATH is
not committed here (lives at /etc/systemd/system/aaronai.service.d/path.conf
on the host).
2026-05-20 00:51:41 +00:00
aaron fda61ad622 api.py: save_document tool — pandoc render to Nextcloud Drafts/ via WebDAV
Claude can now write docx or pdf files to Aaron's Nextcloud Drafts/ when he
asks for a document (bio, cover letter, statement, CV section) rather than
chat text. Pandoc handles markdown -> docx and markdown -> pdf with the
xelatex engine. Upload is a WebDAV PUT against the same Nextcloud instance
dream.py already uses; NEXTCLOUD_URL / NEXTCLOUD_USER / NEXTCLOUD_PASSWORD
in .env are reused. MKCOL ensures Drafts/ exists; PROPFIND-based collision
check appends _2, _3, ... until unique. Filename sanitization strips path
components and unsafe characters.

System prompt instructs Claude to call save_document when the user wants a
file (not chat text) and not to duplicate the file contents in the chat
response — just write the file and tell Aaron where it landed.

ingest.py and watcher.py now skip files under Drafts/ at ingest time so
generated drafts don't pollute future retrieval. Drafts can still be opened,
edited, and shipped; they just don't become part of the searchable corpus
unless Aaron explicitly moves them out of Drafts/.
2026-05-20 00:41:26 +00:00
aaron 84994f9282 api.py: prompt-cache system prompt and memory across tool_use round-trip
Move persistent memory from the user message into system blocks with
cache_control: ephemeral on the last block. The static prefix (system prompt +
memory, ~3-5K tokens typically) is identical between the two LLM calls of a
tool_use round-trip and stable across turns within the 5-minute cache TTL.

Without this, the tool-call retrieval architecture roughly doubled input
token cost on retrieval-needed turns (full context billed twice). With cache
reads at ~10% of standard input, the duplication cost drops by ~90% — the
"twice as expensive" hit becomes "slightly more expensive plus tool overhead."

client_time stays in the user message (per-turn dynamic, should not be in the
cached prefix).
2026-05-19 23:13:43 +00:00
aaron 9e86297e2a api.py: tool-call retrieval, drop the keyword intent classifier
Removes classify_retrieval_intent and the type/folder filter parameters on
retrieve_context. The keyword classifier was the same anti-pattern as the
formatting-driven docx chunker: a heuristic that locks the user into specific
phrasings and fails silently on anything novel. A scope enum (personal /
library / conversations / memory) would have been the same heuristic in a
fancier wrapper — the categories themselves are mine, not Aaron's.

New shape: a retrieve_documents tool exposed to Claude. Tool takes a single
query argument; the model decides when to call it, what to search for, and
how many times per turn (multi-query falls out naturally for compound asks).
Pre-LLM retrieval is gone — memory still rides as ground truth in the prompt,
but corpus content is fetched on demand by the model with concrete queries
it crafts itself, not the user's raw phrasing.

retrieve_context is now pure: hybrid retrieval + cross-encoder rerank + dedup,
no filters. The reranker ranks, the model judges relevance. When ranking
fails (e.g. abstract instructional queries pulling philosophy books), the
right fix is a better reranker, not another query-time taxonomy. That work
is acknowledged but deferred.

System prompt updated to teach the model about the tool and to prefer
concrete tokens (named entities, project names, course codes) over abstract
phrasing when constructing search queries.
2026-05-19 23:05:25 +00:00
aaron 9955c7e383 encoding: per-slide pptx chunking + extract_blocks API; api: recency tiebreak
extract_blocks(filepath) is the new structured-extraction entry point, returning
list[{heading, text, kind}]. chunk_and_embed accepts either str (blind-chunk
back-compat) or list[dict] (one chunk per block, blind-split if oversize, heading
prepended for retrieval context and stored in metadata).

- pptx: one block per slide. Slide title becomes block heading; speaker notes
  fold into the body. Image-only decks with title-only slides now produce
  heading-only chunks instead of being recorded as extraction failures.
- docx: deliberately single-block (back-compat). Heading-style section detection
  was implemented and rolled back: hand-formatted CVs are Normal-styled with
  bold-as-heading, and tying chunk boundaries to formatting choices would lock
  future-user into preserving those choices forever. Lexical + cross-encoder
  retrieval already handles substring matching inside blind-chunked CVs.
- pdf/txt/md: unchanged (single block, blind chunking).

Recency tiebreak in retrieve_context: pull created_at into the SELECT, use it
as secondary sort key in _rerank so memory/journal snapshots prefer the latest
copy among near-duplicate content.

reindex_docx_pptx.py now accepts --ext=pptx,docx... so re-ingest can target a
subset; previous hardcoded delete regex would have wiped both even with a
single-ext target.
2026-05-19 21:58:25 +00:00
aaron 50b97e2998 api.py: folder-aware retrieval, near-duplicate dedup, folder in citations
Three refinements to retrieve_context, all keyed off observed failures from
test_retrieval.py:

- Library/personal split. classify_retrieval_intent now returns
  (type_filter, folder_exclude_prefixes). Biographical document intent excludes
  Library/* so philosophy/cognition books stop crowding out CVs and dossiers
  for queries like "write me a bio".

- Near-duplicate collapse. Multi-folder copies of the same file (e.g., several
  Teaching Philosophy.pdf in different application folders) used to fill the
  top-N with the same content. Dedup by first-300-chars hash after rerank.

- Folder in source citations. Surface metadata.folder alongside basename so
  the LLM can disambiguate among 21 CV.docx variants and the user can see
  which copy a citation refers to.

Also: bump hnsw.ef_search to 500 when a WHERE filter is present.
pgvector 0.6 doesn't iterate past its initial HNSW candidate list, so a
restrictive filter that excludes the nearest neighbors otherwise returns
empty.
2026-05-19 21:35:28 +00:00
aaron 8d560f9f5e api.py: hybrid retrieval with intent routing and cross-encoder rerank
Replaces pure-dense top-8 retrieval with a three-stage pipeline:
- BM25 (tsvector + websearch_to_tsquery) and dense (pgvector) in parallel,
  fused with Reciprocal Rank Fusion
- Optional type filter driven by classify_retrieval_intent() so questions
  about prior conversations don't pull documents and vice versa
- Cross-encoder rerank (ms-marco-MiniLM-L-6-v2) over RRF candidates before
  taking final top-N

Also adds scripts/reindex_docx_pptx.py — one-off re-ingest used to recover
table/header/text-box content in docx and pptx after the 93c0d89 extractor
upgrade — and scripts/test_retrieval.py to exercise the new pipeline against
representative queries.

Schema: requires GIN index on to_tsvector('english', document) (already
created out-of-band via psql since Apache AGE in shared_preload_libraries
blocks ALTER TABLE on this database).
2026-05-19 21:11:15 +00:00
aaron 732e450d21 Stop silent data loss in voice capture pipeline
Empty transcripts and transcription failures previously
deleted the temp audio and returned without writing any
record to disk — violating parity-at-encode (raw content
is episodic context, not noise).

- Preserve audio in Journal/Media/YYYY-MM/ on all paths
  (success, empty, failure) instead of unlinking.
- Write a markdown entry to Journal/Captures/ on failure
  paths with status, audio_path, and error fields.
- Add status: saved to successful captures so frontmatter
  is uniform across success and failure.
- Fire SSE capture_saved events on all terminal paths,
  with status included.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 23:41:51 +00:00
aaron 63c58b5bb3 Extend session lifetime to 365 days
Single-user personal app threat model is theft-of-device, not
stolen-cookie. 30-day idle re-prompts created friction without
proportional security benefit. Server TTL and client max-age
remain in sync via shared constant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 23:29:38 +00:00
aaron 6c2af55e7e Server-side session TTL enforcement
- session_exists() now rejects rows older than 30 days,
  matching the client cookie max-age.
- Opportunistic cleanup of expired rows on session_exists()
  calls, preventing unbounded growth of sessions.db from
  orphaned tokens (PWA reinstalls, manual cookie clears).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 23:28:39 +00:00
aaron 5b4a299414 encoding.py: write_embeddings_batch accepts commit parameter for transactional composition
Adds an optional commit=True parameter to write_embeddings_batch. When True
(default, matching prior behavior), the function commits the connection
after the per-row UPSERT loop. When False, the caller manages the
transaction.

This unblocks fix #1 (pgvector-bypass paths) and fix #2 (watcher
two-transaction pattern), both of which need to compose embeddings writes
with other database writes in the same transaction. Without this lever,
either fix would require duplicating the UPSERT logic outside this helper
or introducing a second commit boundary inside an otherwise atomic
operation.

No behavior change for existing callers — they all use the default
commit=True and continue working unchanged.
2026-05-05 02:52:33 +00:00
aaron b09e35892c encoding.py: strip frontmatter from .md at extraction time
The capture endpoint (api.py:702, 833) writes Journal/Captures/*.md
files with a markdown-bold-style header block (`**type:** voice`,
`**modality:** audio`, `**status:** unprocessed`, optional `**media:**`
and `**project:**`) followed by a `---` separator. extract_text for .md
was a bare filepath.read_text, so every capture-derived chunk in
pgvector embedded the frontmatter as raw text, polluting retrieval.

Fix adds _strip_md_frontmatter, called only for the .md branch:

- Capture-style: optional leading H1 (preserved), then consecutive
  `**key:** value` lines (and blanks), terminated by `---`. The H1 is
  retained; the key/value block + separator are removed.
- YAML-style: file's first non-empty line is `---`, terminated by `---`.
  Only triggered when no heading precedes — guards against the common
  `# Title` + `---` (horizontal rule under heading) pattern seen in
  Journal/aaronai-architecture.md and four other Journal/*.md files.

Body `**bold:**` lines (e.g. `**Visual description:**` in image
captures) and body `---` horizontal rules are never touched: the scan
aborts as soon as a non-frontmatter line appears in the leading block.

briefing_generator_v2.py's split("---", 1) heuristic was reviewed and
not reused — fragile on substring matches and on documents with
multiple `---` rules.

Verified against:
- 2026-04-26-22-44-voice.md: frontmatter stripped, body retained, H1
  retained.
- 2026-04-27-04-34-image.md: frontmatter stripped, `**Visual
  description:**` and `**Voice annotation:**` body bold-headers
  retained, trailing `---` not consumed.
- Journal/aaronai-architecture.md (5 body `---` rules): output
  byte-identical to read_text (96101 chars).
- Synthetic YAML doc: stripped correctly when no leading heading.
- Synthetic plain markdown with body `---` rules: untouched.
- Empty input + heading-only file: untouched.

Existing capture chunks in pgvector retain polluted text; the fix only
affects future extractions. Backfill decision deferred — the cleanest
path is `touch -h Journal/Captures/*.md` to bump mtime and let the
watcher re-ingest naturally on the next cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 02:20:55 +00:00
aaron e38d283e59 watcher.py: exclude 3 image-only pptx files from ingestion
Three files in the original ingest_failures cohort have been
characterized via direct OCR and confirmed to lack ingestible text:

- Presentations/Renders.pptx — 35 PICTURE-shape renders, 33/35 zero-char
  on OCR, 2 with noise (20 and 29 chars).
- Presentations/Ribbon Cutting Slideshow.pptx — 10-slide event photo
  deck, 9/10 zero-char, 1 with 17 chars of noise.
- Academic/DDF555 3D Computational/GH Slicer Notes [Autosaved].pptx —
  Office autosave duplicate of GH Slicer Notes.pptx; first 9 images
  byte-identical (sha256) to the canonical file. 2 net-new images
  contribute 36 noisy chars. Excluding to prevent double-embedding the
  same content under two source filenames.

Pattern matches f18fb64 (path.parts membership). Folder-level globs
were considered and rejected: /Presentations/ contains successfully
embedded text-bearing decks (aaronnelson_3D 4D.pptx,
aaronnelson_slideslam.pptx). Exact-name + parent-folder membership
applied in both watcher filter sites (get_changed_files and
IngestHandler._should_ignore).

The fourth file in the cohort, GH Slicer Notes.pptx (the canonical
non-autosave version), was confirmed to carry 379 chars of real text
(Grasshopper UI / code samples) across 6/9 images. It remains in
ingest_failures unresolved, awaiting the eventual ocrmypdf backlog
pass.

Cleanup: 3 ingest_failures rows resolved (the excluded files).
Unresolved count: 94 → 91.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 01:42:40 +00:00
aaron 8e61e4dedb docs: OCR install record for 2026-05-04
Tesseract OCR installed on the VPS (apt: tesseract-ocr, tesseract-ocr-eng).
Python wrappers added to venv (pip: pytesseract, ocrmypdf).

This commit is the install record only. No code change — async OCR
worker, capture path integration, and backlog processing are separate
followups.

Smoke test results captured in the file:
- pytesseract on a textual GH Slicer Notes.pptx slide image: 126 chars
  in 0.22s (Renders.pptx, also in the 4-image-only-pptx cohort, was
  tried first but contains only rendered designs with no text — noted
  as a likely candidate for exclusion rather than OCR).
- ocrmypdf on a 4-page Lexmark CX510de scan from the Tenure/Dossier
  Scan 2022 set: 2270 non-whitespace chars in 3.72s (~0.93s/page).
  Real readable English; usable as the reference timing for the
  eventual async worker queue.

Deferred decision: project has no dependency manifest (no
requirements.txt, pyproject.toml, etc). Tracking that as its own
followup rather than bolting it onto this install. The capture-path
integration commit will be the natural point to address it if it
hasn't been resolved by then.
2026-05-04 16:58:30 +00:00
aaron 7b77794319 api.py: enable PRAGMA foreign_keys=ON in _connect helper; clean up 2 message orphans
The messages table declares FOREIGN KEY (conversation_id) REFERENCES
conversations(id), but PRAGMA foreign_keys was never enabled — SQLite
defaults it to OFF per connection, and _connect() did not set it. Two
orphan rows existed in messages (conversation_id='test123' pointing at
a never-existing conversation; both rows from one ~11-second test event
on 2026-04-26).

Audit before changing the PRAGMA:
- All FOREIGN KEY declarations across both DBs (conversations.db,
  sessions.db) accounted for via PRAGMA foreign_key_list on each
  table. Only one FK exists: messages.conversation_id ->
  conversations.id, ON DELETE NO ACTION.
- All tables enumerated via sqlite_master. Two tables in
  conversations.db (conversations, messages); one in sessions.db
  (sessions). No surprises.
- PRAGMA foreign_key_check confirmed exactly the 2 known orphans and
  zero violations elsewhere.

Both delete paths in api.py (delete_conversation at :471, and
clear_all_conversations at :986) already delete from messages BEFORE
conversations, so cascade behavior was correct in code. The orphan
state was caused by a direct INSERT against a non-existent
conversation_id at chat-test time, which an unenforced FK silently
accepted. Turning the PRAGMA on prevents this class of bug at insert
time, not delete time — no delete-path code changes were needed.

Order of operations followed the constraint that orphan cleanup must
precede PRAGMA-on (SQLite would not retroactively delete orphans, but
foreign_key_check would surface them confusingly on any future
operation that touched the messages table):
1. DELETE FROM messages WHERE conversation_id NOT IN (SELECT id FROM
   conversations) — removed the 2 known orphans.
2. Added PRAGMA foreign_keys=ON to _connect() so every connection
   from _connect_conversations() and _connect_sessions() gets FK
   enforcement (SQLite requires per-connection setting).
3. Restarted aaronai.service.

Verification:
- Smoke: GET /api/conversations and /api/conversations/{id}/messages
  both return 200 with expected payloads against the live api.
- E2E single-delete: synthetic conversation + 2 messages inserted via
  the api's _connect helper (FK on); DELETE /api/conversations/{id}
  via the live endpoint removed both rows from both tables.
- Clear-all e2e: skipped on live DB (destructive) — code shape is
  structurally identical to single-delete, no FK-relevant logic
  difference.
- Load-bearing negative test: INSERT into messages with a
  non-existent conversation_id via _connect_conversations() raised
  sqlite3.IntegrityError("FOREIGN KEY constraint failed"). This is
  what proves the PRAGMA actually took effect, not just that we set
  it.

Final counts: 7 conversations, 290 messages (down from 292 by the 2
orphans cleaned up).

Note: an explicit BEGIN/COMMIT around the two-execute delete paths
was considered and skipped. SQLite's implicit-transactional default
already gives the atomicity needed; explicit transactions would be
clarity-only and belong in a separate commit.
2026-05-04 16:41:55 +00:00
aaron d985f9e91e dream.py: raise_for_status on manifest writes; total_chunks as actual corpus count
Two correctness bugs in dream_pipeline manifest assembly.

write_manifest at lines 487-491 swallowed HTTP 4xx/5xx responses
silently. requests.put() only raises on transport-level errors (DNS,
connection refused, timeout); 401/403/500/507 come back as Response
objects and never trigger the except. The code printed "Manifest
written" while the manifest never persisted. The same file's deliver()
function at line 434 already used response.raise_for_status() — the
pattern was already established, write_manifest just skipped it.

Fix: bind the response and call raise_for_status() before the success
print. The except message changes from "(non-critical)" to "manifest
not persisted" because HTTP failure now means manifest data was lost,
which is critical, not quiet.

corpus_data["total_chunks"] at lines 621-622 stored
delta["new_chunks"], duplicating the sibling field
new_chunks_since_last_dream. The field name claimed absolute corpus
size; the value was a delta of recently-touched files. Verified in
live manifests: total_chunks: 0 while pgvector held 11,379+ document
embeddings.

Fix: query SELECT COUNT(*) FROM embeddings inside dream_pipeline,
store as total_chunks. Tightly-scoped one-shot connect via the
existing get_pg() helper. Telemetry query failure is treated as
non-critical and falls back to 0 — pgvector hiccup should not crash
an otherwise successful dream pipeline.

Bonus finding (not fixed in this commit): new_chunks_since_last_dream
is itself misnamed. observe_corpus() reads the watcher's mtime cache
and counts files (not chunks) whose mtime is newer than last_dream.
Both fields were "files touched since last dream" duplicated under
two different names; this commit fixes only the total_chunks
semantics. Renaming new_chunks_since_last_dream is out of scope —
manifests are write-only telemetry today, no consumer reads either
field, and the rename is a separate decision.

Verification: real pipeline run produced manifest with total_chunks
matching SELECT COUNT(*) directly; doubled as a smoke test for the
embedder cache (single Loading weights line), type_distribution
propagation, and the manifest write success path.
2026-05-04 16:29:04 +00:00
aaron b9eea6cb62 watcher.py: extend lockfile filter to catch UTF-8-mangled ~$ prefixes
Three rows in ingest_failures were Office lockfile leftovers whose
filename starts with ~� (~ followed by the UTF-8 replacement
character) instead of ~$. Somewhere in the Nextcloud sync chain the $
byte was lost or replaced; the file now lives on disk as a real file
with this corrupted name. The watcher's ("~$", ".") prefix filter
didn't match, so each cycle tried to ingest these as pptx, hit
BadZipFile inside python-pptx (lockfiles aren't real Office documents),
and they ended up permanently in ingest_failures.

Three filter sites in watcher.py applied the lockfile prefix check:
  - ingest_file() at :127
  - get_changed_files() at :200
  - IngestHandler._should_ignore() at :290

All three now match ("~$", "~", ".") — broadened to catch any tilde
prefix, not just ~$. The cross-check against pgvector embeddings and
disk found zero legitimate tilde-prefixed files in the corpus, so the
broader filter has no false-positive risk in this corpus.

Cleanup: 3 ingest_failures rows resolved (filepath LIKE '%/~%').
Unresolved count drops 97 → 94.

If a fourth filter site is ever added, the right shape is consolidating
the lockfile prefix check to a shared function or constant. Three
parallel sites with three different tuple orderings is acceptable for
now but worth normalizing if the surface grows.
2026-05-04 16:19:56 +00:00
aaron 93c0d89308 encoding.py: extend docx and pptx extractors to walk tables, headers/footers, text-boxes, group shapes, and notes
The previous extractors walked only top-level body paragraphs (docx) and
top-level shape.text (pptx). Diagnostic on the 17 non-PDF "no_text"
ingest failures revealed that 13 docx files in the failure cohort have
100% of their content in tables (paras_with_text=0, table_cells=6-108).
These are syllabi, rosters, rubrics, and homework worksheets structured
as a single document-wide table — high-value academic content the corpus
was silently missing.

docx walker now covers:
- body paragraphs (existing)
- tables, including nested tables in cells (recursive helper)
- header and footer paragraphs per section
- text-box content via XPath against w:txbxContent (no first-class API
  in python-docx; future-proofing — none of the current failure cohort
  has text-boxes)

pptx walker now covers:
- top-level shape text (existing)
- recursive descent into group shapes
- table cell text via shape.has_table / shape.table.iter_cells()
- speaker notes via slide.notes_slide.notes_text_frame.text

Out of scope: SmartArt diagrams, chart titles/labels, OLE objects,
content controls. None of the current failure cohort has these.

Recovery: 13 of 17 failures now ingest successfully. The 4 remaining are
image-only pptx files (Renders.pptx, Ribbon Cutting Slideshow.pptx, two
GH Slicer Notes variants — all PICTURE-shape decks with no text in any
walkable structure). They stay in ingest_failures unresolved, awaiting
OCR or path exclusion.

Side effect worth noting: the regression check on 4 known-good files
that were already producing embeddings showed all four gained content
under the new walker — a Mod03 pptx grew from 23,993 to 57,462 chars
(+33,469), Braskem Report docx grew 33,050 to 38,977 (+5,927), DDF MA
program docx grew 37,210 to 47,603 (+10,393), SUNY PIF GRANT pptx grew
22,259 to 23,546 (+1,287). These files have been in the corpus all
along with table or notes content silently dropped. They will surface
the additional content on next re-ingest, improving retrieval quality
for any future query that touches them.

Cleanup: ingest_file already calls resolve_ingest_failure on successful
ingest, so the 13 recovered files were marked resolved=TRUE during the
retry pass. No separate cleanup SQL was needed.
2026-05-04 16:12:56 +00:00
aaron f18fb64fe5 watcher.py: exclude generative-graphic folders and zero-byte files
Two-sample diagnostic of the 128 ingest_failures rows surfaced two
folders whose contents are exclusively non-text PDFs (iText-produced
generative graphics from Processing sketches and computational design
sketches) and three zero-byte test artifacts. None of these have ever
produced an embedding chunk, and they have nothing extractable to
contribute. Excluding them removes 19 / 128 (15%) of the locked-out
failures from the cohort and prevents future versions of the same
patterns from re-failing.

Folder exclusions use path.parts membership rather than substring
matching — eliminates false-match risk if similarly-named folders
appear elsewhere in the corpus (e.g. an unrelated "Generative Design"
or "Computational Design 2017" directory created later). The existing
"Admin/Backups" / "Journal/Media" substring checks are looser, but
new exclusions take the tighter pattern.

Zero-byte filter goes in get_changed_files() only — the actual
ingestion gate. Adding stat() to _should_ignore() (the FS-event noise
filter) would introduce a race where the file is gone between event
fire and stat call. Empty files briefly trigger pending=True but
produce no work after debounce; cosmetic only.

Cleanup applied separately via UPDATE: 19 ingest_failures rows for
these paths marked resolved=TRUE. Unresolved-failure count: 129 -> 110.

Verified: get_changed_files() with empty state returns 1418 changed
files; all 5 excluded probes (2 folder-matched + 3 zero-byte) absent
from the result, control file present. Watcher service restarted
clean; startup scan reports no missed files.
2026-05-04 06:24:08 +00:00
aaron 72e07afc03 watcher.py: do not mark failed ingests as successfully ingested
ingest_files() updated state[path] = mtime unconditionally after every
ingest_file() call. ingest_file() returns 0 when text extraction fails,
embedding fails, no chunks are produced, or the pgvector write fails —
in every one of those cases, the path was still recorded as ingested
at the current mtime. On the next pass, get_changed_files() saw the
mtime match and skipped the file, locking it out of the corpus until
something modified it on disk.

record_ingest_failure() writes to a UI-visible failures table, but
nothing reads that table to retry. So failures accumulated silently:
the file was simultaneously logged as failed AND tracked in
watcher_state as up-to-date, and the second condition won.

Fix: only update watcher_state when ingest_file returns count > 0.
Failed ingests will be retried on the next watcher cycle until they
succeed or are explicitly excluded.

Diagnostic at fix time: 129 rows in ingest_failures, 128 currently
locked out of the corpus (filepath in watcher_state with mtime matching
current disk). 128/129 are text_extraction failures, mostly scanned
PDFs (106 .pdf, 13 .docx, 7 .pptx, 2 .md, 1 .txt). 1 source no longer
exists on disk. 0 have had their disk mtime change since failing — i.e.
without this fix, none of them would ever retry. Cross-check shows
watcher_state has 1466 paths vs. 1061 distinct sources in pgvector
embeddings, leaving a residual silent-gap of ~276 files after
accounting for failures.

Historical cleanup of files already locked out by this bug is tracked
separately. New failures from this commit forward will retry naturally.
2026-05-04 03:52:01 +00:00
aaron c3011c80a5 api.py: route all sqlite3.connect() through helpers; enable synchronous=NORMAL per-conn
Followup to 4204806 (WAL + index + backup.sh). The previous commit
deferred synchronous=NORMAL because it's a per-connection PRAGMA and
api.py has 16 sqlite3.connect() call sites — setting it once at init
would have applied to nothing afterwards.

Adds three helpers near the *_DB constants:
- _connect(path): inner; sets PRAGMA synchronous=NORMAL and uses
  timeout=5.0 (5000ms busy_timeout) on every new connection.
- _connect_conversations(), _connect_sessions(): named wrappers so call
  sites read explicitly.

Mechanical replacement at all 16 call sites: 4 sessions, 12 conversations.
No semantic change beyond the PRAGMA + busy_timeout — every site still
opens-then-closes, no held-open connections.

busy_timeout=5000ms is cheap insurance: under WAL with api.py as sole
writer, contention should be near-zero, but the backup.sh online-backup
path briefly holds a read lock on the source, and any future second
writer would otherwise hit SQLITE_BUSY immediately on contention.

Combined effect with WAL: per-write fsync count drops from ~2 to ~1
(WAL alone) further reduced by synchronous=NORMAL deferring fsyncs to
checkpoint boundaries. No durability loss for the use case (single
host, app crash tolerated, OS crash gives at most one lost transaction).

Not included: foreign_keys=ON. Audit found 2 orphan rows in messages
(conversation_id pointing to deleted conversations) and untested write
paths that could begin raising IntegrityError. Tracked as separate
followup: inspect orphans, identify the delete path that didn't
cascade, clean up, then enable enforcement and test chat delete flow
end-to-end.
2026-05-04 03:39:13 +00:00
aaron 4204806c80 conversations.db, sessions.db: enable WAL, add message index; update backup.sh
Both databases ran with journal_mode=delete — every write rewrote the
rollback journal per transaction. WAL eliminates the journal-rewrite and
lets readers run without blocking writers.

Index on messages(conversation_id, timestamp DESC) is preventive — only
280 rows today, but the access pattern (load conversation history in
order) is exactly what a composite index serves, and we don't want to
re-revisit this when the table grows.

backup.sh updated in the same commit because WAL changes the on-disk
layout: a bare `cp` of just the .db file can miss recently-committed
transactions that still live in the -wal sidecar, and can race with
concurrent writes to produce a torn file. Switched to the SQLite Online
Backup API via python3 -c "...src.backup(dst)..." — same mechanism as
the sqlite3 CLI's `.backup` (which isn't installed on this host),
handles WAL correctly without forcing a checkpoint, and is non-locking
from the writer's perspective. Verified backup integrity_check returns
ok and row counts match.

Note: synchronous=NORMAL was considered but deferred — it's a
per-connection PRAGMA, and applying it correctly requires a connect
helper that wraps every sqlite3.connect() call site in api.py (~14
sites). Out of scope for this commit; tracked as a follow-up. WAL alone
delivers the journal-rewrite elimination and reader/writer concurrency
improvements; the additional fsync reduction from synchronous=NORMAL is
a smaller marginal win on top.

Confirmed via concurrency audit that api.py is the sole writer to both
databases. ingest_conversations.py and dream.py are read-only consumers
of conversations.db; nothing else touches sessions.db.
2026-05-04 03:24:51 +00:00
aaron c5fc517fef ingest_conversations.py: lazy-load embedder to match ingest.py pattern
Embedder was instantiated at module import (~30-60s, ~200MB) regardless
of whether new conversations existed. On nights with no new content
(most nights per the logs), the script paid the load cost and exited
immediately. ingest.py:134 already uses lazy loading; this brings the
two ingest scripts into a consistent shape.
2026-05-04 03:13:45 +00:00
aaron b35d44ef58 dream.py: cache the SentenceTransformer embedder across retrieve() calls
Pipeline mode calls retrieve() three times (NREM, Early REM, Late REM).
Previously each call re-imported and re-instantiated SentenceTransformer
("all-MiniLM-L6-v2"), allocating ~200MB and spending 30-60s on disk->CPU
init three times sequentially. lru_cache(maxsize=1) makes the load happen
once per process.

Expected: pipeline runtime drops ~100-180s, removes 2x redundant 200MB
allocations, and reduces transient memory pressure during the same window
when other nightly jobs may run.
2026-05-04 03:11:22 +00:00
aaron a27f22ceaf api.py: switch whisper to distil-large-v3, beam_size=1, cpu_threads=4
Three changes to reduce voice-note transcription latency on the VPS:
- Model: large-v3 -> distil-large-v3 (~6x faster, near-identical English
  accuracy; language is already hardcoded "en").
- beam_size: 5 (default) -> 1 (~3-4x faster on clean audio).
- cpu_threads: 8 -> 4 (the box has 8 cores running api, dreamer, watcher,
  nextcloud concurrently; ctranslate2's inter-op pool plus context switching
  makes 4 effectively faster than 8 here).

Combined effect expected ~10-15x over prior config. No accuracy regression
expected for the voice-note use case (English, clean audio, domain terms
already supplied via initial_prompt).
2026-05-04 01:00:32 +00:00
aaron 7c7b649775 embeddings: enforce type/created_at on writers; manifests carry type_distribution (Improvement #2 part B+C)
Writers now enforce type and created_at:
  - encoding.py: ValueError raised at write_embeddings_batch if row dict lacks
    'type'. created_at remains SQL-supplied (NOW() server-side). ON CONFLICT
    DO UPDATE now also rewrites type=EXCLUDED.type and preserves the original
    created_at via COALESCE(embeddings.created_at, EXCLUDED.created_at) — a
    re-ingest re-classifies type but does not overwrite a backfilled mtime.
  - ingest_conversations.py: same assertion. ON CONFLICT intentionally keeps
    EXCLUDED.created_at semantics (Aaron-AI conversation created_at tracks
    convo.updated_at; re-runs should refresh).
  - Column-level NOT NULL is not added; application-layer raise gives a
    faster, more debuggable failure than a Postgres constraint error.

Retrieval propagates type into chunks:
  - retrieve() SELECT now includes type; chunk dicts carry "type": etype.
  - WHERE clause built dynamically from excluded_sources and the new
    --type-filter CLI arg (experimental, default None, pgvector retrieval
    only — Graphiti chunks have no embeddings.type to filter on).
  - retrieve_graphiti unchanged; its chunks lack the type field.

Manifests carry type_distribution per stage:
  - dream_pipeline writes stage_data[<stage>]["type_distribution"] for nrem,
    early_rem, late_rem — a Counter over chunk types, filtering None so
    Graphiti chunks (when DREAMER_SUBSTRATE=graphiti) don't pollute the
    distribution. Pgvector chunks always carry type post-backfill; if None
    appears, the backfill or writer enforcement has regressed.

Verification:
  B1 force re-ingest of "Finite and infinite games -- James Carse.pdf":
       all 84 chunks preserved created_at=2026-04-27T06:11:55Z
  B2 missing-type assertion raises ValueError, no row leaked to embeddings
  B3 ast.parse(*) clean; EXPLAIN renders for {no excl/no filter,
       type_filter only, excl 2 elems, excl 1 elem edge case, both};
       all five plans use HNSW index scan with correct Filter clauses
  C1 retrieve("nrem") returns 8 chunks each carrying "type" key
  C2 type_distribution = {'document': 5, 'chatgpt_conversation': 3} —
       2 distinct types, 62.5/37.5 split (looser bar: >=2 types,
       no single type >=90%)

The type and created_at fields are now load-bearing: every dream manifest
emits type_distribution per stage. Reverting the backfill makes the
distribution show NULLs at every dream run.
2026-05-04 00:15:43 +00:00
aaron 3c7c228db0 embeddings: backfill type and created_at (Improvement #2 part A)
Backfills 9,815 type-NULL rows to 'document' (extension classifier, 100% hit)
and 12,109 created_at-NULL rows via five batches:

  C1 filepath_stat:        9,649  filesystem mtime via metadata.filepath
  C2 watcher_state_unique:   676  unique source-name lookup in watcher_state
  C3 watcher_state_collision_pick_latest_of_N:
                             234  collision; most-recent watcher mtime
  C4 chatgpt_export:       1,548  convo create_time from export JSONs
                                  (168/168 distinct convo_ids resolved)
  C5 sentinel:                 2  2026-04-26T00:00:00Z (pgvector migration date)

Provenance written to metadata.type_source and metadata.created_at_source
on every row changed by this run. type_source is empty on rows where the
type field was already populated pre-run; in those cases the snapshot
table is the source of truth for what changed.

Snapshot: embeddings_backup_2026_05_03 (CREATE TABLE AS SELECT id, type,
created_at, metadata FROM embeddings; 14,069 rows; revertable via id-join).

Verification:
  V1 live counts:      type_null=0  ca_null=0
  V2 spot-check 11 rows across cohorts: provenance correct
  V3 snapshot intact: 14,069 rows, pre-backfill NULL counts preserved
  V4 cross-check vs snapshot: reconciles per-provenance to dry-run

Read-side use (B + C: writer enforcement + minimal retrieval read) deferred
to a separate session. The backfill is complete and verified, but the type
and created_at fields are not yet load-bearing — every current reader still
ignores them. Without B+C this lands as data prep, not behavior change.
2026-05-03 23:58:53 +00:00
aaron 2df1a2fe01 docs/inventory: layer 2026-05-03 updates (resolutions, corrections, new findings)
Inventory dated 2026-05-02 is preserved as a point-in-time snapshot. Today's
updates are layered on top in a dated addendum section after "Findings
summary" and before "Phase 1 — Scripts" so the original snapshot reads as
written and readers can see what changed and when.

Resolved:
- NREM-shape divergence #1 (`dream.py` cumulative cross-night exclusion
  500-cap) — replaced with session-scoped novelty.

Corrections to existing findings:
- `stage2_metadata` lives on `stage_3_queue`, not `stage_2_queue` (the
  2026-05-02 entry implied otherwise). Verified by direct schema read.
- Stage 2 char_length gate runs *before* the Mistral call. For sub-2000-char
  docs, Mistral is never invoked — frames are not extracted then discarded,
  they are simply not extracted. Reframes the architecture's "Stage 2
  produces orientation for everything" commitment.

New findings (from the 2026-05-03 frame analysis):
- `ingest_conversations.py` bypasses Stage 2 entirely. 198 conversation
  sources have zero frame coverage by design. Combined with the char-gate
  exclusion and Stage 2 failures, only 56% of corpus has any frame data.
- All 14 voice notes and all 39 dream outputs are in the 339-doc gap.
  Primary capture and self-reflection channels are silent to the frame
  system; dreamer cannot frame-condition on its own output.
- File-type \u00d7 frame stratification provides discriminating signal that
  cross-links Improvement #3 to the existing `embeddings.type` NULL-rate
  finding.

Same NREM shape as the original cumulative-exclusion bug — the architecture's
stated commitment and what the code actually does diverge silently. This is
exactly what the inventory exists to surface.
2026-05-03 20:32:55 +00:00
aaron ed2d090afc experiments/frame_distribution_report: Stage 2 frame analysis (Track 1 Improvement #3)
Read-only inspection of the frame data Mistral produces in Stage 2, in
service of Track 2 substrate design (Step 2.4 operation set spec).

Artifacts:
- New SQL view `stage2_frames_v` over `stage_3_queue.stage2_metadata`
  (CREATE OR REPLACE; idempotent; raw JSONB exposed alongside structured
  fields so worker-version drift is inspectable).
- Analysis script: frequency, label-hygiene collisions, per-doc count,
  co-occurrence (top-K), file-type \u00d7 frame cross-tab, worker-version split,
  data-gap accounting, corpus-wide coverage.
- JSON sidecar for diff-across-runs reproducibility.
- Markdown report with explicit Track 2 viability section.

Headline findings:
- Frames cluster meaningfully on the framed-doc subset (subject to
  validation on larger samples for the file-type cross-tab).
- Only 56% of corpus has frame coverage. 198 conversation sources bypass
  Stage 2 by design (`ingest_conversations.py` writes directly to
  embeddings); 339 short docs (<2000 chars) skip Mistral by char-gate;
  12 Stage 2 failures.
- All 14 voice notes and all 39 dream outputs are in the data gap.
  Primary capture and self-reflection channels are silent to the frame
  system. Dreamer cannot frame-condition on its own output.
- 54 normalized label collisions (`Professional Experience` vs
  `Professional_Experience`, etc.) — any router must normalize first.
- "Education" is a near-universal frame (36% of frame-extracted docs);
  cheap 20-doc hand-inspection diagnostic in report \u00a78 to distinguish
  prompt artifact from corpus shape.
- File-type \u00d7 frame stratification is concrete signal that ties to
  Improvement #2 (`embeddings.type` backfill); currently NULL for 71% of
  rows.

No production code touched. View is droppable; script is read-only.
2026-05-03 20:32:37 +00:00
aaron e5898f3019 dream.py: replace cumulative cross-night exclusion with session-scoped novelty (Track 1 Finding 1)
The cumulative `retrieved_sources` list (capped at 500, trimmed to 400 on
overflow) was hiding ~40% of the corpus from Early REM and Late REM after the
cap filled. The architecture and reframe both specify session-scoped novelty,
not corpus-lifetime exclusion. Same NREM-shape divergence as the 2026-05-02
NREM exclusion fix.

Changes:
- Drop `previously_retrieved` load; pop the legacy `retrieved_sources` key
  from `dreamer_state.json` at pipeline start.
- Early REM excludes only the current session's NREM high-scorers.
- Late REM excludes only the current session's NREM \u222a Early REM.
- Remove the across-night accumulation block at the end of the pipeline; reuse
  the in-scope state object for the post-pipeline metadata write (eliminates a
  redundant disk re-read that was reintroducing the legacy key).

NREM exclusion fix from 2026-05-02 preserved (`nrem_chunks = retrieve("nrem",
excluded_sources=None)`).

Verification: post-fix dream-manifest source count rose to 24 (NREM 8 + Early
REM 8 + Late REM 8) vs. 13 / 16 on the two prior comparable runs. Legacy key
absent from `dreamer_state.json` post-run.
2026-05-03 20:32:15 +00:00
aaron 1101bef226 scripts/encoding.py: Stage 1 dual-implementation consolidation (Track 1 Finding 11)
Consolidates four extract paths and two extract-chunk-embed-write pipelines
into a single shared encoding module. Fixes the embedder lifecycle
divergence between watcher and /api/reindex (no more 200MB reload per
reindex click) and unifies failure tracking so /api/reindex failures now
surface in SettingsPanel "Ingest Health".

New files:
- scripts/encoding.py — extract_text, chunk_text, chunk_and_embed,
  write_embeddings_batch
- scripts/failures.py — record_ingest_failure, resolve_ingest_failure
  (shared by watcher.py and ingest.py)

Refactored:
- scripts/watcher.py — drops local extract/chunk/embed implementations
  and CHUNK_SIZE/CHUNK_OVERLAP/SUPPORTED constants; imports from encoding
  and failures. Now writes ingest_failures row on empty-text-extract
  (was silent return 0).
- scripts/ingest.py — substantial rewrite. Exposes ingest_directory(folder,
  embedder=None) for in-process invocation; CLI back-compat preserved via
  ingest_folder wrapper. Module-level SentenceTransformer load removed.
- scripts/corpus_integrity.py — imports extract_text from encoding;
  extract_text_for_retry function removed.
- scripts/api.py — /api/reindex rewritten with BackgroundTasks (uses
  module-level embedder; no subprocess); new /api/reindex/status endpoint
  reading ~/aaronai/reindex_status.json; /api/corpus/retry imports
  extract_text from encoding; INGEST_SCRIPT constant removed (dead after
  this refactor); 409 reentrance guard prevents double-click stomping.

Behavior changes:
- /api/reindex no longer subprocess.Popens; runs in FastAPI BackgroundTasks
  threadpool, doesn't block API thread.
- /api/reindex no longer reloads SentenceTransformer on each click.
- /api/reindex failures newly write to ingest_failures (visible in
  SettingsPanel "Ingest Health" — badge will jump on first reindex).
- New embeddings rows always have created_at = NOW() (canonical, server-side).
- New embeddings rows always include metadata.folder field (None when not
  derivable).
- /api/reindex returns 409 on second click while a job is running.
- New /api/reindex/status endpoint for polling.

Existing 9,815 NULL created_at rows remain unchanged; backfill is a
separate decision if desired.

199 insertions, 256 deletions across 6 files (codebase shrinks net).

Found by Track 1 inventory 2026-05-02 (Finding 11 / cross-cutting F11).
Pre-commit verification: BackgroundTasks already imported, sys.path
resolves correctly via script-path semantics, static import clean.
2026-05-03 01:40:47 +00:00
aaron a317df66f8 dream: factor prompts into module-level templates, repair prompt_hash (Track 1 Finding 11)
prompt_hash() in dream.py was hashing function __doc__ strings, but the
synth functions don't have docstrings, so the hash was always MD5("") =
d41d8cd9 for every dream. The manifest field meant to detect undeclared
prompt drift carried no useful information.

Refactor:
- Each synth function's prompt template moved to a module-level constant
  (NREM_PROMPT_TEMPLATE, EARLY_REM_PROMPT_TEMPLATE, LATE_REM_PROMPT_TEMPLATE,
  SYNTHESIS_PROMPT_TEMPLATE, LUCID_PROMPT_TEMPLATE) using str.format()
  placeholders instead of f-string interpolation.
- Synth functions call TEMPLATE.format(...) at use time. Output is byte-
  identical to the previous f-string implementation.
- prompt_hash() now hashes the four pipeline template constants (lucid is
  on-demand, not part of the nightly manifest — preserves prior scope).
- LUCID_DEFAULT_TASK extracted as a named constant from the lucid fallback
  question (factoring only, no behavior change).
- PROMPT_VERSION_* constants and synth function signatures untouched.
- v1.1 register-shift comment in synthesize_early_rem preserved inline.

The post-fix hash will differ from d41d8cd9 (verified: b65695a1 in static
test). Historical manifests still carry d41d8cd9; the discontinuity is
intentional — pre-fix hashes were equally meaningless and faking continuity
would be worse than acknowledging the break.

Found by Track 1 inventory 2026-05-02 (Finding 11 / divergence #11).
Verified static import + hash determinism before commit.
2026-05-03 00:24:21 +00:00
aaron ec67e19b4f docs/: track Track 1 inventory and reorg plan
These are working artifacts of the 2026-05-02 Track 1 stabilization
work. Versioning them alongside the code keeps the operational
narrative coherent and gives future sessions clear reference docs.

The inventory document includes the cross-repo verification finding
on share_time — captured at the document level so future sessions
don't repeat the same dead-code mischaracterization.
2026-05-03 00:00:16 +00:00
aaron 4b520b2bc2 api.py: minor cleanups (Track 1 inventory findings)
- Fix /auth/check endpoint that referenced undefined SESSIONS
  (Phase 1 finding — would NameError 500 on every call). Now uses
  session_exists(token), the live session-validation mechanism
  defined elsewhere in api.py.
- Remove unused DB_PATH ChromaDB-era constant (paired with the
  ChromaDB directory deletion and aaronai-maintenance.service
  removal earlier this session).

Found by Track 1 inventory 2026-05-02. Cross-repo verification of
share_time (third candidate from the original cleanup proposal)
revealed it is working stores-and-returns persistence rather than
dead code; share_time intentionally not modified.

Inventory document edits are committed separately under the docs/
tracking decision.
2026-05-02 23:59:20 +00:00
aaron 7bebd8ae50 api.py: wire up dream_mode setting (Track 1 Finding 9)
The dream_mode setting was defined in DEFAULT_SETTINGS and watched
by update_settings for reschedule, but run_dream_job never read it —
silently-ignored configuration.

Two changes:
1. DEFAULT_SETTINGS["dream_mode"] flipped from "nrem" to "pipeline".
   The default was a latent regression vector: wiring up the setting
   without changing the default would have silently switched all
   default-config users from full-pipeline (current production
   behavior) to NREM-only nightly runs.
2. run_dream_job reads dream_mode at fire-time, validates against
   {"pipeline", "nrem", "early-rem", "late-rem"}, falls back to
   pipeline with a warning on invalid values. Lucid intentionally
   excluded — it is on-demand only by design and remains available
   via CLI and /api/dreamer/run.

Nightly dream production behavior is unchanged for current users
(no settings.json key → default "pipeline" → no flag passed → same
as before). Users can now meaningfully change the nightly mode by
editing settings.json or via the SettingsPanel.

Found by Track 1 inventory 2026-05-02 (Finding 9 / divergence #9).
2026-05-02 23:38:29 +00:00
aaron 3f7fba7e0e scripts/: separate production from experimental and deprecated
Moves 28 experiment scripts to scripts/experiments/ (E1, E1.4, E1.6, E2,
base_class, cascade, cost_test, briefing, consistency, token series).
Moves 2 dissolved-layer scripts to scripts/deprecated/ (consolidator_v0_1.py,
tier1_migration.py — under the bespoke decision both target retired
substrate work).
Removes 19 .bak* files from disk (gitignored, never tracked; git history
is the durable record of every prior version).

The 11 production scripts remain in scripts/. All systemd ExecStart paths,
api.py subprocess calls, and cron jobs continue to resolve correctly —
verified by grep against /etc/systemd/system/aaronai-*.service, scripts/
references in api.py, and the user crontab.

Track 1 inventory cross-cutting finding: scripts/ mixed 11 production
files with 32 experimental scripts and ~20 .bak files. After this commit
a clean-room reader can identify the live workers from a directory listing
alone.

Found by Track 1 inventory 2026-05-02. See
~/aaronai/docs/scripts-reorg-plan-2026-05-02.md for full reasoning.

After commit, run:
1. git log --oneline -3 — show the new commit on top
2. git status — confirm clean working tree (modulo the docs/ untracked files which are intentional)
2026-05-02 23:28:24 +00:00
aaron 6f2d274d5d api.py: remove 50KB truncation from /api/corpus/retry (completes F14)
The F14 fix on 2026-05-01 removed text[:50000] truncation from
watcher.py, ingest.py, and corpus_integrity.py. The retry endpoint
in api.py was missed — clicking 'Retry' on an ingest-failed file
in the SettingsPanel re-introduced the exact truncation pattern
F14 was meant to eliminate.

Found by Track 1 inventory 2026-05-02 (Finding 2 / divergence #2).
2026-05-02 22:56:33 +00:00
aaron 7615dedf9e dream: NREM does not exclude prior traces
NREM in the reframe is replay-and-consolidation of recent encoded
content. Excluding previously_retrieved sources turns NREM into
novelty-finding, which is Late REM's job. NREM should re-traverse
already-encoded content; that's what consolidation is.

The May 2 abort surfaced this — 52 sources accumulated in the
exclusion list, all of them in NREM's similarity band for the
recurring research/fabrication/teaching query. The dreamer hit
zero retrievable chunks not because the corpus was empty, but
because everything semantically aligned was excluded.

Late REM and Early REM keep the exclusion mechanism — novelty is
their job. Session-scoped exclusion (nrem_high_sources flowing
into Early REM) also preserved.

The 500/400 trim on retrieved_sources is preserved for the
remaining stages that still use it.
2026-05-02 21:33:49 +00:00
aaron 1a8e0353f5 stage3_worker: v2.2 — absolute sudo/systemctl paths, error logging, reset failure counter on recovery failure
Mirrors stage2_worker v2.1 (da98019) resilience fixes:
- Absolute paths for /usr/bin/sudo and /bin/systemctl
- Log stdout/stderr when sidecar restart fails
- Reset consecutive_failures even when wedge recovery fails (prevents
  permanent stuck state if restart itself is broken)
2026-05-01 18:40:25 +00:00
aaron da980193dd stage2_worker: v2.1 — terminal failure states + sudo path fix
Three classes of silent failure converted to clean terminal states:

- Mistral timeout: previously left rows in zombie state (started_at set,
  failed_at null, attempts incremented past retry threshold, row invisible
  to selection query). Now sets failed_at with reason
  'mistral_timeout_after_300s'. Surfaced 2026-05-01 when 17 documents
  accumulated in this state during the Stage 3 saga deadlock incident.

- Mistral parse failure: run_mistral returns {'error': 'parse_failed'} on
  JSON decode failure but process_one wasn't checking, so empty orientation
  ('Active frames: . Frame relationships: ...') was shipped to Stage 3.
  This is F22 from the 2026-04-30 code review. Now sets failed_at with
  reason 'mistral_parse_failure'.

- Wedge recovery hammering: consecutive_failures was only reset on
  successful Ollama restart. With the sudo path bug (also fixed here),
  recovery always failed, so every subsequent failure re-attempted restart.
  Now resets the counter regardless and logs the failure visibly.

Also: subprocess.run now uses absolute paths (/usr/bin/sudo,
/bin/systemctl) instead of relying on PATH, fixing the 'No such file or
directory: sudo' error that broke Stage 2's recover_wedge() since
deployment. F45-adjacent — sudoers entries were added 2026-05-01 but the
PATH issue was masking that fix.

Worker version bumped to 2.1 to match Stage 3's resilience patch level.
2026-05-01 17:28:53 +00:00
aaron b936931668 Stage 3 worker v2.1 — saga-size limit + wedge detection + sudoers fixes
Production incident 2026-05-01: F14 re-cascade attempt surfaced three
compounding issues in cascade resilience.

stage3_worker.py changes:
- MAX_CHUNKS_PER_SAGA=10 — large documents split into multiple bulk
  commits, all sharing the same saga tag for Graphiti document linking.
  Original implementation sent all chunks as one saga; 17-19 chunk sagas
  deadlocked sidecar's Python-side coordination.
- recover_wedge() function — restarts aaronai-graphiti.service when
  consecutive_failures hits threshold. Mirrors Stage 2 pattern.
- run() loop adds consecutive_failures counter with threshold-2
  escalation. Resolves F28 + F29 from code review.
- Worker version bumped 2.0 -> 2.1.
- post_bulk() helper extracts shared HTTP POST + error handling.

Outside-repo changes (system config, separately documented):
- WatchdogSec=600 commented in stage2 + stage3 systemd unit files.
  Workers have no sd_notify support; per-request timeouts in code
  handle the actual failure modes.
- /etc/sudoers.d/aaron-aaronai created with NOPASSWD entries for
  systemctl restart ollama and restart aaronai-graphiti.service.
  Stage 2's existing recover_wedge() was silently broken since
  deployment due to this gap.

.gitignore — added rules for *.bak files, runtime artifacts
(watcher_heartbeat, dreamer_state.json, corpus_integrity_report.json,
watcher_state.json, watcher_status.json), Python cruft, virtual env,
.env, editor/OS files, and Aaron AI runtime data (conversations.db,
sessions.db, memory.md, settings.json).

Untracked 11 files that shouldn't have been committed in 465f2f7
(this morning): backup files and runtime artifacts.

Re-cascading Shop Class (414KB) and BirdAI-Experiments-Log.md (192KB)
through the patched worker after re-extracting full text from disk.
Cascade in progress at commit time.
2026-05-01 05:18:09 +00:00
aaron 465f2f725b Code review fixes: CV pinning, F1 (excluded_sources), F14 (50KB truncation), F37
- api.py: strip CV pinning workaround (parity violation, see architecture doc)
- dream.py: F1 — retrieve_graphiti() now accepts excluded_sources, over-fetches
  3x and filters in-process. Was silently dropping the parameter; would have
  confounded E3 with broken cross-stage exclusion in Graphiti arm.
- watcher.py + ingest.py: F14 — drop full_text[:50000] truncation. Was
  propagating through entire cascade. Postgres TEXT can hold up to 1GB.
- corpus_integrity.py: F37 — same truncation, third path now clean.

Backups: api.py.bak.*, dream.py.bak.*, watcher.py.bak.*, ingest.py.bak.*,
corpus_integrity.py.bak.* timestamped pre-fix.

Re-cascaded Shop Class as Soulcraft (only already-cascaded source affected
by F14, 414KB).
2026-05-01 02:26:37 +00:00
aaron 25e42c0231 corpus_integrity.py: write unreadables with retry_count=0 so OCR can retry when it ships 2026-04-30 22:03:48 +00:00
aaron 7822fb1cc1 corpus_integrity.py: write unreadable files to ingest_failures for UI visibility 2026-04-30 21:59:06 +00:00
aaron 74e2c34f43 corpus integrity: ingest_failures tracking in watcher, reconciliation script, corpus status/retry/reconcile endpoints 2026-04-30 21:54:39 +00:00
aaron 655dea6ae5 add remaining experiment result files 2026-04-30 18:06:52 +00:00
aaron f11cacd9c9 add experiment scripts and results; watcher.py latest changes 2026-04-30 18:06:03 +00:00
aaron 1cf26df450 api.py: return error_type=transcription_failed on Whisper crash, frontend retry logic can now distinguish from network failures 2026-04-30 17:45:47 +00:00
aaron 7cd765146a stage3_worker.py: log sidecar response body on non-200 2026-04-30 17:37:28 +00:00