Compare commits
62 Commits
7cd765146a
..
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 5582549321 | |||
| 3ec9a48151 | |||
| 9d09d3fa14 | |||
| f185ed60cb | |||
| a4735053c2 | |||
| f682d8c6a0 | |||
| 151c756b89 | |||
| e96bf40b2f | |||
| 313c0f0341 | |||
| d2ec20e373 | |||
| 10bb29290a | |||
| 9bb083f065 | |||
| 430ea239dd | |||
| 0a1e2b4f61 | |||
| 8c2c597687 | |||
| fda61ad622 | |||
| 84994f9282 | |||
| 9e86297e2a | |||
| 9955c7e383 | |||
| 50b97e2998 | |||
| 8d560f9f5e | |||
| 732e450d21 | |||
| 63c58b5bb3 | |||
| 6c2af55e7e | |||
| 5b4a299414 | |||
| b09e35892c | |||
| e38d283e59 | |||
| 8e61e4dedb | |||
| 7b77794319 | |||
| d985f9e91e | |||
| b9eea6cb62 | |||
| 93c0d89308 | |||
| f18fb64fe5 | |||
| 72e07afc03 | |||
| c3011c80a5 | |||
| 4204806c80 | |||
| c5fc517fef | |||
| b35d44ef58 | |||
| a27f22ceaf | |||
| 7c7b649775 | |||
| 3c7c228db0 | |||
| 2df1a2fe01 | |||
| ed2d090afc | |||
| e5898f3019 | |||
| 1101bef226 | |||
| a317df66f8 | |||
| ec67e19b4f | |||
| 4b520b2bc2 | |||
| 7bebd8ae50 | |||
| 3f7fba7e0e | |||
| 6f2d274d5d | |||
| 7615dedf9e | |||
| 1a8e0353f5 | |||
| da980193dd | |||
| b936931668 | |||
| 465f2f725b | |||
| 25e42c0231 | |||
| 7822fb1cc1 | |||
| 74e2c34f43 | |||
| 655dea6ae5 | |||
| f11cacd9c9 | |||
| 1cf26df450 |
+38
-22
@@ -1,34 +1,50 @@
|
||||
# Environment and secrets
|
||||
.env
|
||||
*.env
|
||||
# Backup files (rely on git history instead)
|
||||
*.bak
|
||||
*.bak.*
|
||||
|
||||
# Databases
|
||||
db/
|
||||
conversations.db
|
||||
sessions.db
|
||||
# Runtime artifacts
|
||||
watcher_heartbeat
|
||||
dreamer_state.json
|
||||
corpus_integrity_report.json
|
||||
watcher_state.json
|
||||
watcher_status.json
|
||||
reindex_status.json
|
||||
|
||||
# Python
|
||||
# Logs (these belong in /var/log/)
|
||||
*.log
|
||||
|
||||
# Python artifacts
|
||||
__pycache__/
|
||||
*.pyc
|
||||
*.pyo
|
||||
*.pyd
|
||||
.pytest_cache/
|
||||
*.egg-info/
|
||||
|
||||
# Virtual environment
|
||||
venv/
|
||||
.venv/
|
||||
|
||||
# Logs
|
||||
*.log
|
||||
# Environment and secrets
|
||||
.env
|
||||
.env.local
|
||||
.env.*.local
|
||||
|
||||
# Memory and settings (personal data)
|
||||
memory.md
|
||||
settings.json
|
||||
|
||||
# Backups
|
||||
Admin/
|
||||
|
||||
# OS
|
||||
# Editor and OS cruft
|
||||
.vscode/
|
||||
.idea/
|
||||
*.swp
|
||||
*.swo
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
dreamer_state.json
|
||||
migration_progress.json
|
||||
dreamer_state.json
|
||||
migration_progress.json
|
||||
|
||||
# Local data not for repo
|
||||
db/
|
||||
embeddings/
|
||||
experiments/summary_embeddings_cache.json
|
||||
|
||||
# Aaron AI runtime data (personal, do not commit)
|
||||
conversations.db
|
||||
sessions.db
|
||||
memory.md
|
||||
settings.json
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,846 @@
|
||||
# BirdAI Component Inventory — 2026-05-02
|
||||
|
||||
*Track 1 stabilization, deliverable 1. Read-only investigation.*
|
||||
|
||||
**Repo state:** HEAD `7615ded` (NREM exclusion fix) on baseline `1a8e035`. Last night's experimental work was reverted.
|
||||
|
||||
**Method:** Each component classified Working / Working-degraded / Broken / In-flight / Experimental / Stopped / Deprecated, with last-touched date from `git log -1`, dependencies, dependents, and a behavior-vs-intent column comparing observed code against `aaronai-architecture.md` and `aaronai-architecture-reframe-2026-05-01.md`.
|
||||
|
||||
**A note on terminology.** "Behavior matches intent" is read against two intent surfaces: (1) the architecture doc as written, which still frames graphiti as the target memory layer, and (2) the reframe doc, which supersedes parts of the architecture doc and which the bespoke decision now extends. Where the two diverge, the reframe is treated as canonical for purposes of this inventory; the architecture-doc-only divergences are flagged separately.
|
||||
|
||||
---
|
||||
|
||||
## Findings summary
|
||||
|
||||
This inventory's most useful work is identifying mechanisms that are running silently, without errors, while doing something the architecture didn't ask for. The 2026-05-02 NREM exclusion bug had that exact shape: NREM was excluding prior traces, the dreamer logged "completed," files appeared on schedule, and the architecture's stated commitment (NREM is replay-and-consolidation) was being violated invisibly. Track 1's job is to find the rest of those before they accumulate.
|
||||
|
||||
### Top-priority NREM-shaped divergences (working, but doing something the architecture didn't request)
|
||||
|
||||
These are the items most worth reading the linked phase entries for. They are ranked by potential impact on Track 1 or on subsequent E6-class work.
|
||||
|
||||
1. **`dream.py` cumulative cross-night exclusion (500-cap).** Phase 1, `dream.py`. Early REM and Late REM exclude up to 500 prior sources accumulated across nights. On a 1,200-source corpus this hides ~40% of the corpus from those modes after the cap fills, and trims to 400 only when overflowing — a churn pattern, not an architectural choice. The architecture and reframe specify session-scoped novelty; cumulative-across-nights exclusion is nowhere documented. Same shape as the NREM bug — a deduplication mechanism running silently, the architecture didn't request, and nobody noticed. **This is the highest-priority finding from the inventory.**
|
||||
|
||||
2. **`api.py /api/corpus/retry` reintroduces 50KB truncation.** Phase 1, `api.py`. The F14 fix removed truncation from `watcher.py`, `ingest.py`, and `corpus_integrity.py` on 2026-05-01. The retry endpoint at line 1074 still writes `text[:50000]`. Clicking "Retry" on an ingest-failed file in the SettingsPanel re-introduces exactly the bug F14 fixed. Working without errors; doing the wrong thing.
|
||||
|
||||
3. **`aaronai-stage3.service` is `enabled` while `inactive`.** Phase 2. The session brief says Stage 3 is stopped manually. The unit is `enabled`, so on next reboot the worker auto-starts and resumes processing the `stage_3_queue` rows that Stage 2 has been adding. The "stopped" state is paper-thin. `systemctl disable` would harden it; nobody has done that yet.
|
||||
|
||||
4. **Stage 2 keeps enqueuing to `stage_3_queue` while Stage 3 is off.** Phase 3. As of inventory time, 6 pending rows sit in `stage_3_queue`, last enqueued 2026-05-02 22:33 UTC. The queue grows until Stage 3 is restarted (and then catches up) or stopped at the producer. Nothing is broken — but the system is doing work whose output sits unconsumed.
|
||||
|
||||
5. **`embeddings.type` NULL for 71% of rows; `embeddings.created_at` text-typed and NULL for 87%.** Phase 3. The architecture treats these fields as load-bearing for "type-aware retrieval" and "temporal awareness." In production, most chunks lack both. Retrieval still works because nothing routes on either field. The doc's commitment and the data shape disagree, invisibly to anyone not querying the schema.
|
||||
|
||||
6. **`graphiti_jobs` documented as "empty" but holds 9 rows from the 2026-05-02 experimental run.** Phase 3. Current-state doc explicitly says "exists, empty (or near-empty)." Reality: 6 failed, 3 committed, all from the rolled-back code. Inert (no current code reads or writes), but the rollback narrative is incomplete on this point.
|
||||
|
||||
7. **`aaronai-maintenance.service` references ChromaDB.** Phase 2. The unit invokes `chops hnsw rebuild --path ~/aaronai/db --collection aaronai`. ChromaDB was retired 2026-04-26. `chops` is not in the venv. The `~/aaronai/db/` directory still exists with a ChromaDB sqlite. Saved from doing damage only because its timer is not enabled. A clean-room reading of `/etc/systemd/system/` would suggest BirdAI is still on ChromaDB.
|
||||
|
||||
8. **`aaronai-dreamer.service` hardcodes `--mode nrem`.** Phase 2. Production scheduling fires `dream.py` with no flag (default = full pipeline). The systemd entry-point is the historical "manual NREM" wrapper. Any future maintainer running `systemctl start aaronai-dreamer.service` from the shell expects "the dreamer" and gets only NREM.
|
||||
|
||||
9. **`dream_mode` setting in api.py defaults is silently ignored by the scheduler.** Phase 4. Setting in `DEFAULT_SETTINGS`, mergeable into `settings.json`, used by `update_settings` to decide whether to reschedule. Not actually read by `run_dream_job`. A configurable scheduling parameter that has no effect.
|
||||
|
||||
10. **Watcher-restart cron line uses sudo not in the sudoers file the session brief documents.** Phase 5. The 2026-05-01 sudoers fix listed `restart ollama` and `restart aaronai-graphiti.service`. The watcher-restart cron line uses `sudo systemctl restart aaronai-watcher`. Either there's an additional sudoers entry the brief doesn't mention, or this watchdog has been silently failing every fire. Worth checking `/var/log/aaronai/watcher-cron.log` (out of scope for this read-only inventory).
|
||||
|
||||
11. **`prompt_hash()` in `dream.py` hashes function `__doc__` strings, but none of the synth functions have docstrings.** Phase 1, `dream.py` notes (folded into the "F8" reference). The hash is deterministic across all dreams (always the MD5 of `""`). This is the architecture-doc tech-debt item F8 ("`prompt_hash` broken") confirmed in code: the manifest field meant to "catch undeclared drift" carries a constant value. Same shape as NREM: a mechanism present, running, doing something the architecture-stated purpose explicitly denies.
|
||||
|
||||
12. **Two parallel scheduling stacks.** Phase 5. APScheduler in `api.py` and three dormant `aaronai-*.timer` files. The dormant ones aren't firing, so no actual harm. The presence makes "what triggers the dream" harder to answer than it should be.
|
||||
|
||||
### Cross-cutting findings (not necessarily NREM-shaped)
|
||||
|
||||
- **The `scripts/` directory mixes 11 production files with 32 experimental scripts and ~20 `.bak` files.** Reading the directory it is hard to tell at-a-glance what is live. Track 1 cleanup candidate: move experimental files to `experiments/` (which already exists with a few) or `deprecated/`, and delete `.bak*` (git history is the durable record). This is mostly cosmetic but makes future inventories easier.
|
||||
|
||||
- **Two implementations of Stage 1 (F11) confirmed.** `watcher.py:ingest_file` and `ingest.py:ingest_file` (and `corpus_integrity.py:extract_text_for_retry` plus the api.py retry path) all reimplement extract-chunk-embed-write. The architecture doc records this as known tech debt; the inventory verifies all four call sites still drift.
|
||||
|
||||
- **The bespoke decision dissolves several components without removing them.** `consolidator_v0_1.py`, `tier1_migration.py`, `graphiti_service.py`, `stage3_worker.py`, both Stage 3 unused-column sets in `stage_3_queue`, `graphiti_jobs` table, the experiment scripts. None is actively harmful in current state; collectively they make the bespoke direction harder to read out of the codebase. Track 1 stripping is the right venue for these.
|
||||
|
||||
- **Memory-and-state fan-out.** The system has at least 7 distinct files outside the database that hold state: `dreamer_state.json`, `watcher_state.json`, `watcher_status.json`, `watcher_heartbeat`, `corpus_integrity_report.json`, `tier1_migration_state.json`, `settings.json`, plus two sqlite DBs (`conversations.db`, `sessions.db`) and a markdown file (`memory.md`). Bespoke design will likely consolidate.
|
||||
|
||||
### What looks fine
|
||||
|
||||
The watcher (`watcher.py` + `aaronai-watcher.service`) is a clean Stage 1 that matches the architecture doc and the parity principle exactly. The capture endpoint works as documented. The `ingest_failures` table reflects exactly the 129 unreadable files the architecture doc cites. The frontend route surface is minimal and entirely backed. The 2026-05-01 worker patches (saga-size limit, wedge detection, sudoers, no `WatchdogSec` without `sd_notify`) are visible and correct in code. The NREM exclusion fix is in place and the manual run on 2026-05-02 21:34 UTC produced a real dream.
|
||||
|
||||
### Where I am uncertain
|
||||
|
||||
- I did not read the watcher-cron.log, sudo configuration, or systemd journal directly. The "sudo for `aaronai-watcher` restart" question (Phase 5 / divergence #10) is based on the session brief's stated sudoers contents only.
|
||||
- I did not exhaustively read each of the 32 experimental scripts. I read enough of each (header docstring) to classify; deep behavioral inspection of these is unnecessary for Track 1 but means I cannot rule out additional NREM-shape divergences inside them.
|
||||
- I did not deep-read frontend components (`~/aaronai-web/components/`). Per Phase 6 scope.
|
||||
- The session brief says Stage 3 is "stopped manually." I confirmed `systemctl is-active aaronai-stage3.service = inactive`. I did not confirm via `journalctl` when it was stopped — but the inventory doesn't need that, only the current state.
|
||||
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Updates — 2026-05-03 session
|
||||
|
||||
*Layered updates from Track 1 improvement work on 2026-05-03. The 2026-05-02 inventory above is preserved as a point-in-time snapshot; corrections and resolutions are recorded here with provenance.*
|
||||
|
||||
### Resolved
|
||||
|
||||
- **NREM-shape divergence #1 (cumulative cross-night exclusion 500-cap, `dream.py`) — RESOLVED.** Replaced cumulative `retrieved_sources` with session-scoped novelty. Early REM now excludes only NREM high-scorers from the current session; Late REM excludes the current session's NREM ∪ Early REM. Legacy `retrieved_sources` key cleared from `dreamer_state.json`. Verification: post-fix dream-manifest source count rose to 24 (vs. 13 / 16 on the two prior comparable runs) — the previously-hidden ~40% of corpus is now reachable to Early/Late REM as the architecture and reframe specify. NREM exclusion fix from 2026-05-02 preserved.
|
||||
|
||||
### Corrections to existing findings
|
||||
|
||||
- **`stage2_metadata` location (Phase 1, `stage2_worker.py`):** the metadata column lives on `stage_3_queue.stage2_metadata` (jsonb), **not on `stage_2_queue`**. `stage_2_queue` has only basic queue fields (`id, source, full_text, char_length, timestamps, failure_reason, attempts`). The 2026-05-02 entry implied otherwise. Corrected via direct schema inspection on 2026-05-03.
|
||||
|
||||
- **Stage 2 char_length gate (Phase 1, `stage2_worker.py`):** the `char_length < 2000` check at line 139 runs *before* the Mistral call at line 149. For sub-2000-char docs, Mistral is **never invoked** — the worker logs `Processing → Skipping Stage 3 → completed_at = NOW()` with no Mistral pass between them. The earlier framing of "documents under 2000 chars skip Stage 3" was correct as written, but the implied "Stage 2 produces orientation metadata for everything" architecture commitment is not what the code does. 339 of 1,041 completed Stage 2 docs (33%) have **no frame data extracted at all**, not "frame data extracted then discarded."
|
||||
|
||||
### New findings from 2026-05-03 frame analysis (Improvement #3)
|
||||
|
||||
- **`ingest_conversations.py` bypasses Stage 2 entirely.** 198 distinct conversation sources (`Claude:`, `ChatGPT:`, `Aaron AI:`, plus `type='aaronai_conversation'`) write directly to pgvector `embeddings` and never enter `stage_2_queue`. Conversations have **zero frame coverage by design**, not by accident. Combined with the 339-doc char-gate exclusion and 12 Stage 2 failures, **only 56% of the embeddings corpus has any frame data**. Same NREM shape — a routing decision the architecture didn't explicitly request, doing something silently that the architecture's "Stage 2 produces orientation for everything" commitment denies.
|
||||
|
||||
- **Voice notes (14) and dream outputs (39) are systematically excluded from the frame system.** Within the 339-doc <2000-char gap: all 14 voice notes and all 39 dreamer-output files (NREM, Early REM, Late REM, synthesis markdown) are present. Voice is one of Aaron's primary capture channels. Dream outputs are the dreamer's own reflection. Both are silent to the frame system that orients downstream extraction — meaning the dreamer cannot frame-condition on its own output. Same NREM shape as the others.
|
||||
|
||||
- **File-type × frame stratification signal exists and is currently unused** (cross-link to Phase 3 `embeddings.type` finding). The 2026-05-03 frame analysis (`docs/stage2-frame-analysis-2026-05-03.md` §5) shows that within frame-extracted docs, "Programming" pivots to pptx (n=15), "Application" pivots to pdf (n=13), Education spreads across pdf+docx — file type adds discriminating signal to frame routing. Currently `embeddings.type` is NULL for 71% of rows; backfilling it (Improvement #2, not yet applied) would make this stratification queryable at retrieval time instead of reverse-engineerable from filenames.
|
||||
|
||||
### Artifacts produced 2026-05-03
|
||||
|
||||
- **Code change:** `scripts/dream.py` (Improvement #1).
|
||||
- **New SQL view:** `stage2_frames_v` (over `stage_3_queue.stage2_metadata`; `CREATE OR REPLACE`, idempotent, drop with `DROP VIEW stage2_frames_v;`).
|
||||
- **New analysis script:** `scripts/experiments/frame_distribution_report.py` (read-only).
|
||||
- **JSON sidecar:** `experiments/frame_distribution_2026-05-03.json`.
|
||||
- **Report:** `docs/stage2-frame-analysis-2026-05-03.md`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Scripts
|
||||
|
||||
Inventory of every file under `~/aaronai/scripts/` (and `~/aaronai/scripts/experiments/`). `.bak*` files are listed at the bottom of the section but not individually documented; they are point-in-time snapshots from the rollback work and are not part of any active code path.
|
||||
|
||||
### `api.py`
|
||||
- **Path:** `scripts/api.py`
|
||||
- **Status:** Working
|
||||
- **Last-touched:** 2026-05-01
|
||||
- **What it does:** FastAPI backend on port 8000. Hosts the chat endpoint (`/api/chat`), session-based auth (`/auth/login`, `/auth/logout`, `/auth/check`), conversation CRUD, settings panel API, memory editor, status endpoint, audio transcription via faster-whisper `large-v3`, capture endpoint (voice and image+voice), dreamer-status and dreamer-run, corpus-integrity status / retry / reconcile, and SSE streams for both authenticated dreamer notifications and the public capture page. Embeds an APScheduler `BackgroundScheduler` that drives the nightly dream cycle and conversation ingest. Loads SentenceTransformers `all-MiniLM-L6-v2` and the Anthropic SDK at startup. Auth is a session token in a 30-day cookie backed by `sessions.db` (sqlite). Conversations and messages are in `conversations.db` (sqlite). Document retrieval is pure cosine similarity over pgvector (top-8, threshold 0.3) — the CV-pinning workaround was stripped 2026-04-30.
|
||||
- **Dependencies:** `.env` (`PG_DSN`, `ANTHROPIC_API_KEY`, `AARON_AI_PASSWORD`, `NEXTCLOUD_*`); `~/aaronai/conversations.db`, `~/aaronai/sessions.db`, `~/aaronai/memory.md`, `~/aaronai/settings.json`, `~/aaronai/watcher_status.json`, `~/aaronai/watcher_state.json`, `~/aaronai/dreamer_state.json`, `~/aaronai/corpus_integrity_report.json`; PostgreSQL (`embeddings`, `stage_2_queue`, `ingest_failures`); SentenceTransformer model files; faster-whisper model files; the `dream.py`, `ingest.py`, and `corpus_integrity.py` scripts which it shells out to; Nextcloud WebDAV. Runs as `aaronai.service`.
|
||||
- **What depends on it:** Frontend (`aaronai-web` Next.js) consumes every `/api/*` endpoint; mobile capture layer consumes `/api/capture` and `/api/captures/events`; `dream.py` POSTs to `/api/events/notify` to push SSE to the frontend; the APScheduler embedded in this process is the only thing that triggers the nightly dream cycle and the nightly conversation ingest in production.
|
||||
- **Behavior matches intent?** Partial. Pure-similarity retrieval matches the post-2026-04-30 architecture statement. The `chat` function ignores `client_time` for memory retrieval purposes (just inserts it into the prompt), which is consistent with the doc. Two divergences worth flagging:
|
||||
1. `/auth/check` references `SESSIONS` (line 385) which is undefined — this is dead code (no `SESSIONS` set/dict exists in the file). Auth checking on the frontend evidently relies on the cookie being present rather than this endpoint working; a request would `NameError` 500. Likely a leftover from an earlier in-memory session implementation that was migrated to sqlite without removing the check.
|
||||
2. `transcribe_and_save()` (the background voice capture path, line 670) does NOT save the raw audio file to `Journal/Media/` — only the transcript markdown to `Journal/Captures/`. The architecture doc's "Multimedia Ingest Pipeline" describes `Journal/Media/YYYY-MM/` as the raw-ground-truth location for all captured media. The image+voice path does write image bytes to Media, but voice-only does not. A future Late REM "raw images during synthesis" feature listed as "not yet built" in the architecture doc relies on Media existing, but for voice this means the audio is gone after transcription. Flagged.
|
||||
- **Notes:** APScheduler is created at module import (`scheduler = BackgroundScheduler()` at line 1105) and started in the lifespan. Stage 3 worker code is not invoked from here. The `/api/reindex` endpoint shells out to `ingest.py` which still writes to pgvector and (since `SKIP_STAGE2_ENQUEUE` is unset by default) re-enqueues to `stage_2_queue` — meaning a reindex can put files back through Stage 2 and Stage 3, which under the bespoke decision is no longer the desired path. The retry endpoint at `/api/corpus/retry` writes `text[:50000]` to `stage_2_queue` (line 1074) — reintroducing the 50KB truncation pattern that F14 fixed elsewhere. **NREM-shape divergence: the truncation cap was removed from `watcher.py`, `ingest.py`, and `corpus_integrity.py` per the F14 fix on 2026-05-01, but `api.py` retry path was not patched.**
|
||||
|
||||
### `dream.py`
|
||||
- **Path:** `scripts/dream.py`
|
||||
- **Status:** Working (post NREM-fix)
|
||||
- **Last-touched:** 2026-05-02
|
||||
- **What it does:** The Active Inference engine. Provides the nightly pipeline (NREM → Early REM → Late REM → Synthesis) and a single-mode CLI entry-point. Each stage retrieves chunks from pgvector (or Graphiti when `DREAMER_SUBSTRATE=graphiti`), prompts Claude Sonnet, writes a markdown file to Nextcloud `Journal/Dreams/` via WebDAV, and feeds its output as context into the next stage. Pipeline writes a per-night manifest JSON. Lucid mode is the on-demand path used by Settings → Dream Now. State persisted in `~/aaronai/dreamer_state.json`; cumulative `retrieved_sources` capped at 500, trimmed to 400 on overflow. Score-band Early-REM exclusion (v1.1) preserved. The 2026-05-02 NREM exclusion fix is at line 478: `nrem_chunks = retrieve("nrem", excluded_sources=None)`.
|
||||
- **Dependencies:** `.env` (`PG_DSN`, `ANTHROPIC_API_KEY`, `NEXTCLOUD_*`); `pgvector` `embeddings` table (or graphiti sidecar `/search`); SentenceTransformer `all-MiniLM-L6-v2` (re-loaded inside `retrieve()`); `~/aaronai/dreamer_state.json`, `~/aaronai/watcher_state.json`, `~/aaronai/conversations.db`; Anthropic API; Nextcloud WebDAV; for SSE notify, the running `api.py` on `localhost:8000`.
|
||||
- **What depends on it:** APScheduler in `api.py` shells out to it nightly; `/api/dreamer/run` shells out for on-demand runs; `aaronai-dreamer.service` (Type=oneshot) wraps it for manual invocation; `e3_dreamer_substrate.py` invokes it under `DREAMER_SUBSTRATE=graphiti`.
|
||||
- **Behavior matches intent?** Yes for NREM (post-fix matches reframe's replay-and-consolidation framing); yes for Early REM and Late REM (still consult `previously_retrieved`, which the reframe permits as novelty bias); partial for Synthesis (no substrate mutation, which is fine under the architecture doc but is exactly what the reframe says is missing for E6 to work); "lucid" is implemented even though architecture doc lists Lucid mode as "not yet built" (the function exists and is reachable from the CLI/API).
|
||||
- **Notes:** `retrieve_graphiti()` accepts and applies `excluded_sources` (the F1 fix), but the over-fetch is `n_results * 3` and the post-filter is in-process. Dreamer falls back gracefully to empty when sidecar fails. **NREM-shape divergence candidate: the dreamer's exclusion-set state is *cumulative across all nights*, capped at 500 — every Early REM and Late REM excludes up to 500 prior sources. On a corpus of 1,200 sources this is ~40% of the corpus permanently invisible to Early/Late REM after the cap fills. The architecture doc and reframe don't specify cumulative-across-nights exclusion; they specify session-scoped novelty. The bug shape is the same as the NREM exclusion bug — a deduplication mechanism functioning silently in a way the architecture didn't request.** Flagged.
|
||||
|
||||
### `watcher.py`
|
||||
- **Path:** `scripts/watcher.py`
|
||||
- **Status:** Working
|
||||
- **Last-touched:** 2026-05-01
|
||||
- **What it does:** Stage 1 of the encoding pipeline. Watches `/home/aaron/nextcloud/data/data/aaron/files` recursively via watchdog. Loads SentenceTransformer `all-MiniLM-L6-v2` once at startup. On modify/create/move/close events, debounces 120s, then chunks (500-word with 50-word overlap), embeds, and writes to pgvector `embeddings`. Enqueues full text to `stage_2_queue` unless `SKIP_STAGE2_ENQUEUE` is set. Records extraction or pgvector failures to `ingest_failures` and resolves them on success. Heartbeat written every loop tick to `~/aaronai/watcher_heartbeat`. Status JSON written to `~/aaronai/watcher_status.json`. Startup recovery scans for files with changed mtimes since last run. `on_moved` checks `dest_path` (Nextcloud writes `.part` then renames), `on_closed` belt-and-suspenders.
|
||||
- **Dependencies:** `.env` (`PG_DSN`); pgvector; SentenceTransformer; `pypdf`, `python-docx`, `python-pptx`; watchdog; `~/aaronai/watcher_state.json`. Runs as `aaronai-watcher.service`.
|
||||
- **What depends on it:** Anything that reads from pgvector `embeddings` (api.py chat, dream.py retrieval, tier1_migration.py); anything that polls `stage_2_queue` (stage2_worker); `corpus_integrity.py`; the watcher heartbeat is consumed by an external cron monitor mentioned in tech-debt.
|
||||
- **Behavior matches intent?** Yes against the architecture's Stage 1 description and the parity principle (no filtering, no decisions). The full-text path no longer truncates to 50KB. Under the bespoke decision the Stage 2 enqueue path is on the chopping block; it is currently still active and runs by default.
|
||||
- **Notes:** No truncation in `enqueue_stage2()`. `Admin/Backups` and `Journal/Media/` are excluded from indexing per the architecture's File Management Policy. `SKIP_STAGE2_ENQUEUE` env var is the documented kill-switch for migration runs.
|
||||
|
||||
### `ingest.py`
|
||||
- **Path:** `scripts/ingest.py`
|
||||
- **Status:** Working-degraded (functional but architecturally redundant)
|
||||
- **Last-touched:** 2026-05-01
|
||||
- **What it does:** Bulk folder ingester. Loads SentenceTransformer at module import, walks a folder, extracts text, chunks, embeds, writes to `embeddings`, and (unless `SKIP_STAGE2_ENQUEUE`) enqueues to `stage_2_queue`. Invoked by `api.py`'s `/api/reindex` endpoint with `NEXTCLOUD_PATH` as argument. CLI default target is `~/aaronai/docs`.
|
||||
- **Dependencies:** Same as `watcher.py` minus watchdog. `.env`, pgvector, SentenceTransformer. No service unit — invoked on demand only.
|
||||
- **What depends on it:** `api.py` `/api/reindex` button; the architecture's tech-debt entry mentions `ingest_chatgpt.py` and `ingest_claude.py` (manual one-shot scripts) but neither of those files is present in `scripts/` — so the only live caller is `/api/reindex`.
|
||||
- **Behavior matches intent?** Partial. The architecture doc has it as one of four ingest scripts in the Layer 1 table. Only this file and `ingest_conversations.py` exist. The chunk-embed-store flow still matches Stage 1 intent. The Stage 2 enqueue side effect (running every reindex) is a wide blast radius — clicking "Re-index" puts every changed file back through cascade, which under the bespoke decision is increasingly unwanted work.
|
||||
- **Notes:** Almost the entire chunk/embed/extract code path is duplicated verbatim with `watcher.py`. The architecture's tech-debt entry F11 (two implementations of encoding pipeline) is real — visible side-by-side. Both scripts call their own `enqueue_stage2()` defined inline; both call SentenceTransformer at import (model is loaded twice if both are imported in the same process, which only happens during unusual import patterns).
|
||||
|
||||
### `stage2_worker.py`
|
||||
- **Path:** `scripts/stage2_worker.py`
|
||||
- **Status:** Working
|
||||
- **Last-touched:** 2026-05-01
|
||||
- **What it does:** Polls `stage_2_queue` for rows with no `completed_at`/`failed_at` and `attempts < 3`. Sends document to local Mistral (`mistral:latest` via Ollama on port 11434) with a taxonomy-free prompt that returns four fields: `active_frames`, `frame_relationships`, `extraction_orientation`, `one_sentence_summary`. Documents under 2000 chars skip Stage 3 and are marked complete. Otherwise builds an orientation string and enqueues `stage_3_queue` with `(source, full_text, orientation, stage2_metadata)`. Wedge recovery: 2+ consecutive failures triggers `sudo systemctl restart ollama`. Logs to `/var/log/aaronai/stage2.log`. Heartbeat at `/var/log/aaronai/stage2-heartbeat`. Worker version 2.1.
|
||||
- **Dependencies:** `.env` (`PG_DSN`); Ollama on `localhost:11434`; `mistral:latest` model loaded; passwordless sudo for `/bin/systemctl restart ollama` (per `/etc/sudoers.d/aaron-aaronai`); PostgreSQL `stage_2_queue` and `stage_3_queue` tables. Runs as `aaronai-stage2.service`.
|
||||
- **What depends on it:** Anything that reads `stage_3_queue.completed_at` (corpus_integrity, api.py corpus status); Stage 3 worker as the queue consumer.
|
||||
- **Behavior matches intent?** Partial under the reframe. The taxonomy-free prompt matches the Stage 3.1 research direction the architecture doc described. Under the bespoke decision the entire Stage 2/3 pipeline is being re-evaluated; the worker itself is doing what it was redesigned to do.
|
||||
- **Notes:** `recover_wedge()` calls absolute `/usr/bin/sudo` and `/bin/systemctl` paths (per the v2.1 patch). No `WatchdogSec`-driven SIGKILL pattern (commented out in the systemd unit per the 2026-05-01 fix). Mistral parse-failure is detected and surfaces as `failure_reason='mistral_parse_failure'`. `RETRY_ATTEMPTS = 2` plus the original attempt = 3 max attempts before the row is dead; this matches the worker's SQL `attempts < %s` with `RETRY_ATTEMPTS + 1`.
|
||||
|
||||
### `stage3_worker.py`
|
||||
- **Path:** `scripts/stage3_worker.py`
|
||||
- **Status:** Stopped (per session brief — service stopped manually 2026-05-02; code is unchanged)
|
||||
- **Last-touched:** 2026-05-01
|
||||
- **What it does:** Polls `stage_3_queue` for rows ready to process. For each, chunks document at 500-word boundaries (matching Stage 1), and POSTs to graphiti sidecar `/episodes/bulk`. Three paths by document size: (a) <1500 chars → single episode, no saga; (b) ≤10 chunks → single bulk commit with a saga tag; (c) >10 chunks → split into batches of 10 each, all tagged with the same saga so graphiti links them as one document unit. Wedge recovery: 2+ consecutive failures triggers `sudo systemctl restart aaronai-graphiti.service`, then waits 45s for sentence-transformers + BGE reranker + graphiti to re-init. Worker version 2.2.
|
||||
- **Dependencies:** `.env` (`PG_DSN`); graphiti sidecar on `localhost:8001`; passwordless sudo for `/bin/systemctl restart aaronai-graphiti.service`; PostgreSQL `stage_3_queue`. Runs as `aaronai-stage3.service`.
|
||||
- **What depends on it:** `corpus_integrity.py` reads `stage_3_queue.completed_at` to compute "Graphiti-side" coverage; `api.py`'s `/api/corpus/status` does the same.
|
||||
- **Behavior matches intent?** No, against the bespoke decision. The architecture doc describes Stage 3 as the cascade ingest path into graphiti; the bespoke decision dissolves that path. The code itself does what it was patched to do (saga splitting, wedge detection, sudoers). What it represents — feeding documents into a graphiti substrate — is no longer the architectural target.
|
||||
- **Notes:** Service is stopped per the session brief, but `stage_3_queue` rows continue to be created by `stage2_worker.py`, so the queue grows monotonically while the consumer is off. This is fine for the rollback baseline (no new rows of consequence with cascade prompts in the rolled-back form), but is worth flagging in case the watcher picks up new files. Uses the absolute `/usr/bin/sudo` and `/bin/systemctl` paths (v2.2 patch). `start` and `end` chunk indices are 1-based in the saga-batch logging — cosmetic only.
|
||||
|
||||
### `graphiti_service.py`
|
||||
- **Path:** `scripts/graphiti_service.py`
|
||||
- **Status:** Working (per the session brief; will be deprecated when bespoke substrate replaces graphiti)
|
||||
- **Last-touched:** 2026-04-30 (commit), 2026-05-02 (working-copy mtime — same content, file was rewritten then reset during rollback)
|
||||
- **What it does:** FastAPI sidecar on port 8001. Wraps `graphiti-core` to avoid asyncio event loop conflicts in the main FastAPI process. Single graphiti instance built in lifespan, closed on shutdown. Endpoints: `/health`, `POST /episodes` (single), `POST /episodes/bulk` (with optional `saga` link), `GET /search`. Uses `SentenceTransformerEmbedder` from `st_embedder.py` and `BGERerankerClient` from graphiti-core. `FalkorDriver` connects to FalkorDB at `localhost:6379` database `aaron`. LLM provider switchable via env (`anthropic` default → `claude-sonnet-4-6`). `max_coroutines=2`, `EMBEDDING_DIM=384`. Hard-coded group default `aaron`.
|
||||
- **Dependencies:** `.env` (`ANTHROPIC_API_KEY` or `LLM_API_KEY`, `LLM_PROVIDER`, `LLM_MODEL`, `FALKORDB_HOST`, `FALKORDB_PORT`, `GRAPHITI_GROUP_ID`); FalkorDB Docker container on `127.0.0.1:6379`; graphiti-core 0.29.0 in venv; sentence-transformers, BGE reranker. Runs as `aaronai-graphiti.service`.
|
||||
- **What depends on it:** `dream.py` `retrieve_graphiti()` (only when `DREAMER_SUBSTRATE=graphiti`); `stage3_worker.py` posts to it; `tier1_migration.py` posts to it; the bulk cost-test scripts post to it; `e3_dreamer_substrate.py` queries it; `e1_8_taxfree_cascade.py` and `e1_9_retroactive.py` post or query.
|
||||
- **Behavior matches intent?** Yes against the architecture doc. Under the bespoke decision this whole sidecar is the layer being replaced; the doc still says it's the target memory layer.
|
||||
- **Notes:** `add_episode_bulk()` is called with `saga=req.saga or None` — the saga param is what stage3_worker uses to link split-batch chunks. Result body returns `{"ok": true, "count": N}` rather than the underlying graphiti return value. Logs full traceback to `/var/log/aaronai/graphiti-sidecar.log` (the 2026-04-30 fix).
|
||||
|
||||
### `corpus_integrity.py`
|
||||
- **Path:** `scripts/corpus_integrity.py`
|
||||
- **Status:** Working
|
||||
- **Last-touched:** 2026-05-01
|
||||
- **What it does:** Three-way reconciliation. Compares filesystem (Nextcloud), pgvector (`embeddings.source`), and graphiti (`tier1_migration_state.json` ingested list ∪ `stage_3_queue.completed_at IS NOT NULL` source list). Reports counts in each set, and gaps (in filesystem but neither pgvector nor graphiti). With `--fix`, attempts text extraction on each gap file and either enqueues to `stage_2_queue` (full text, no truncation) or writes to `ingest_failures` if extraction returns empty. Writes `~/aaronai/corpus_integrity_report.json`.
|
||||
- **Dependencies:** `.env`; pgvector `embeddings`, `stage_3_queue`, `ingest_failures`, `stage_2_queue`; `~/aaronai/experiments/tier1_migration_state.json`; pypdf, python-docx, python-pptx. No service unit — invoked by `api.py /api/corpus/reconcile` background task and by the user manually.
|
||||
- **What depends on it:** `api.py /api/corpus/status` reads the report it writes; the SettingsPanel UI's "Ingest Health" section consumes that.
|
||||
- **Behavior matches intent?** Partial. Implements the architecture's "ingest_failures + reconciliation" tech-debt-resolved item correctly. Under the bespoke decision, the graphiti side of the reconciliation is meaningless after Stage 3 is shut off — the script will keep happily reporting "this many sources are in graphiti" but those numbers won't move and won't represent useful state. Not broken, but the report's "graphiti only" / "Both" lines become semantically empty.
|
||||
- **Notes:** Re-implements `extract_text` for retry path inline rather than reusing watcher's; another instance of F11.
|
||||
|
||||
### `ingest_conversations.py`
|
||||
- **Path:** `scripts/ingest_conversations.py`
|
||||
- **Status:** Working
|
||||
- **Last-touched:** 2026-04-27
|
||||
- **What it does:** Nightly job. Reads `conversations.db`, finds conversations with ≥3 user-assistant exchanges, slides a 2-exchange window, formats `[Aaron AI conversation: title]` chunks, embeds with SentenceTransformer, writes to pgvector `embeddings` with `id = aaronai_conv_{conv_id}_{idx}` and `type='aaronai_conversation'`. Idempotent via `ON CONFLICT DO UPDATE`.
|
||||
- **Dependencies:** `.env`; pgvector; `conversations.db`. Triggered by APScheduler in `api.py` at 02:30 UTC.
|
||||
- **What depends on it:** Anything reading from pgvector. Indirect: dream.py and chat retrieval pull these chunks.
|
||||
- **Behavior matches intent?** Yes. Matches the architecture's Layer 1 ingest table.
|
||||
- **Notes:** No watchdog/state — re-runs each night and skips already-embedded ids. `cur.close()` is missing on the read connection at line 39 (the conn is closed though, so it's harmless).
|
||||
|
||||
### `st_embedder.py`
|
||||
- **Path:** `scripts/st_embedder.py`
|
||||
- **Status:** Working
|
||||
- **Last-touched:** 2026-04-27
|
||||
- **What it does:** `EmbedderClient` adapter for graphiti-core. Wraps SentenceTransformer `all-MiniLM-L6-v2` (384-dim) so graphiti uses the same embedding model as Stage 1. No API cost for graphiti embeddings.
|
||||
- **Dependencies:** `graphiti_core.embedder.client`, sentence-transformers.
|
||||
- **What depends on it:** `graphiti_service.py` imports it at sidecar startup.
|
||||
- **Behavior matches intent?** Yes. Implements the "embedding layer stays on Sentence Transformers regardless of LLM" architectural commitment.
|
||||
- **Notes:** Will be obsolete when graphiti is replaced under the bespoke decision (the embedder pattern carries over but this specific adapter does not).
|
||||
|
||||
### `tier1_migration.py`
|
||||
- **Path:** `scripts/tier1_migration.py`
|
||||
- **Status:** Stable but unused (already-run one-shot)
|
||||
- **Last-touched:** 2026-04-30
|
||||
- **What it does:** Migrates ~300 most-recent pgvector sources to graphiti via the sidecar's `/episodes/bulk` endpoint. Resumable via `~/aaronai/experiments/tier1_migration_state.json`. Adapts batch size to document length (`BATCH_SIZE=4`, `LONG_DOC_BATCH_SIZE=2` for docs ≥5000 chars). Implements Max-pending-queries / timeout / rate-limit backoff. Writes per-batch results to `tier1_migration_results.json`.
|
||||
- **Dependencies:** `.env` (`PG_DSN`); graphiti sidecar; `~/aaronai/experiments/`. No service unit.
|
||||
- **What depends on it:** `corpus_integrity.py` reads the state file. `api.py` corpus status reads the same file. Both treat ingested-list as part of the "graphiti coverage" answer.
|
||||
- **Behavior matches intent?** Yes against the architecture's Tier 1 migration plan (already complete per the doc — 1,205 sources, 4,990 nodes, 22,289 edges). Obsolete under the bespoke decision but harmless if not run again.
|
||||
- **Notes:** Hard-codes `timestamp: "2026-04-28T00:00:00"` for migration episodes — all migrated sources land with that bi-temporal `valid_at`. The migration state file lives in `~/aaronai/experiments/`, which is referenced from multiple downstream readers — moving or deleting it would break corpus integrity status.
|
||||
|
||||
### `consolidator_v0_1.py`
|
||||
- **Path:** `scripts/consolidator_v0_1.py`
|
||||
- **Status:** Deprecated (per reframe doc and bespoke decision)
|
||||
- **Last-touched:** 2026-04-29 (commit), 2026-04-30 (working tree)
|
||||
- **What it does:** Calibration-phase alias resolution. Pulls all `:Entity` nodes from FalkorDB `aaron` graph, computes summary embeddings via Ollama `nomic-embed-text`, infers light type labels heuristically, computes pairwise (name, ego, neighbor) similarity within type blocks, writes a markdown proposals log to `Nextcloud/Journal/Consolidation/proposals-{ts}.md` plus a JSON sibling. **Does not execute merges.** The 0.1.5 in-place patch (containment metric replacing Jaccard, summary embeddings) is reflected in this file; the `.bak` is the pre-patch version.
|
||||
- **Dependencies:** FalkorDB on port 6379 (direct, not via sidecar); Ollama for embeddings; `Nextcloud/Journal/Consolidation/`.
|
||||
- **What depends on it:** Nothing in production. Designed for human review of proposals.
|
||||
- **Behavior matches intent?** No, under the reframe and bespoke decision. The reframe doc explicitly identifies "consolidator-as-separate-system" as the architectural mistake — its function moves into the dream phase. Track 1 should consider this a removal candidate.
|
||||
- **Notes:** No service unit, no scheduler entry — executed manually only. Calibration findings (2026-04-29) showed alias-from-graph-features-alone has structural problems on this corpus.
|
||||
|
||||
### `backup.sh`
|
||||
- **Path:** `scripts/backup.sh`
|
||||
- **Status:** Working
|
||||
- **Last-touched:** 2026-04-26
|
||||
- **What it does:** Daily-snapshot bash script. Copies `memory.md`, `settings.json`, `conversations.db` into `~/nextcloud/.../Admin/Backups/` with date-stamped names; deletes anything older than 7 days. Output ends up inside Nextcloud's `Admin/Backups/`, which the watcher excludes from indexing — so backups don't pollute the corpus.
|
||||
- **Dependencies:** Read access to the three files; write access to `Admin/Backups/`.
|
||||
- **What depends on it:** Nothing programmatic. Operationally: the only off-host backup of `memory.md` and `settings.json`.
|
||||
- **Behavior matches intent?** Yes. Lightweight, no-judgement copy → Nextcloud → Nextcloud Desktop → off-machine.
|
||||
- **Notes:** Cron-driven (Phase 5 will confirm). Uses `find -mtime +7 -delete` so naming-format changes wouldn't break retention.
|
||||
|
||||
### Experimental scripts (one-shot research artifacts)
|
||||
|
||||
The following scripts are all completed experiments. None has a service unit, none is on a schedule, none is a runtime dependency of any production code path. They are kept as reproducibility artifacts for the experiments log. **All are candidates for moving out of `scripts/` into `experiments/` or `deprecated/`** — they crowd the production directory and on cursory inspection it is hard to tell at-a-glance which files are live workers.
|
||||
|
||||
| File | Experiment | Status | Notes |
|
||||
|---|---|---|---|
|
||||
| `audit_expansion_draw.py` | Type-aware stratified draw for n=20 audit expansion | Experimental | Sample-construction tool for `base_class_audit_rerun.py` |
|
||||
| `base_class_test.py` | Base-class enrichment n=20 | Experimental | OOP framing experiment, validated 2026-04-28 |
|
||||
| `base_class_validation.py` | Base-class enrichment n=50 | Experimental | Main validation study |
|
||||
| `base_class_audit_rerun.py` | Base-class enrichment audit rerun | Experimental | n=8 paired-extraction audit, 0% fabrication |
|
||||
| `briefing_generator_v2.py` | Experiment 002b (briefing v2) | Experimental | Validated local Mistral structural pattern recognition at 96% |
|
||||
| `briefing_test.py` | Experiment 002 (briefing v1) | Experimental | Superseded by v2 |
|
||||
| `cascade_test.py` | Entity-drafter cascade n=20 | Experimental | Falsified 2026-04-28 |
|
||||
| `cascade_optimization_test.py` | Optimized entity-drafter cascade n=30 | Experimental | Confirmed entity-drafter cascade is dead |
|
||||
| `consistency_test.py` | Mistral 3-pass consistency n=50 | Experimental | Experiment 001 |
|
||||
| `consistency_test_v2.py` | Entity-only consistency, fixed sampling | Experimental | Experiment 003 |
|
||||
| `cost_test_graphiti_bulk.py` | Bulk endpoint cost test | Experimental | Stratified n=50 |
|
||||
| `cost_test_graphiti_bulk_retry.py` | Retry of failed bulk batches | Experimental | Pre-MAX_QUEUED_QUERIES bump |
|
||||
| `cost_test_graphiti_bulk_retry2.py` | Second retry attempt | Experimental | Smaller batches, post-bump |
|
||||
| `cost_test_graphiti_migration.py` | Single-episode migration cost test | Experimental | Stratified n=50 |
|
||||
| `e1_select_sample.py` | E1 sample selection | Experimental | Cascade re-extraction sample |
|
||||
| `e1_run_cascade.py` | E1 orchestration | Experimental | Initial cascade run, group `aaron_cascade_test` |
|
||||
| `e1_run_cascade_corrected.py` | E1 corrected (custom_extraction_instructions path) | Experimental | Re-run with the fixed prompt-path |
|
||||
| `e1_per_source_predicates.py` | E1 per-source predicate count | Experimental | Corrected metric |
|
||||
| `e1_compare_metrics.py` | E1 A vs B metrics comparison | Experimental | Reads from FalkorDB via redis-cli docker exec |
|
||||
| `e14_select_sample.py` | E1.4 sample selection (n=30) | Experimental | Stratified, excludes E1's 10 |
|
||||
| `e14_run_cascade.py` | E1.4 cascade orchestration | Experimental | Group `aaron_cascade_e14` |
|
||||
| `e14_per_source_predicates.py` | E1.4 per-source predicate diversity | Experimental | Bucket-level analysis |
|
||||
| `e16_rate_purity.py` | E1.6 domain-purity human rating UI | Experimental | Surfaces taxonomic-mismatch finding |
|
||||
| `e16_analyze.py` | E1.6 Spearman correlation against E1.4 | Experimental | Pre-registered decision rules |
|
||||
| `e2_resolution_check.py` | E2 entity resolution diagnostic | Experimental | Six test entities, FalkorDB query |
|
||||
| `e2_alias_followup.py` | E2 alias follow-up | Experimental | Aaron AI variants etc. |
|
||||
| `e2_source_diversity.py` | E2 episode count per entity | Experimental | Diagnostic |
|
||||
| `token_measurement_test.py` | Experiment 005 — token reduction | Experimental | Validates 42.0% modeled estimate |
|
||||
| `experiments/e1_8_eval.py` | E1.8 eval phase | Experimental | Pulls predicate counts |
|
||||
| `experiments/e1_8_taxfree_cascade.py` | E1.8 ingest phase | Experimental | Taxonomy-free cascade |
|
||||
| `experiments/e1_9_retroactive.py` | E1.9 retroactive validation | Experimental | Phase 1 parked 2026-04-30 (graph immature) |
|
||||
| `experiments/e3_dreamer_substrate.py` | E3 dreamer substrate comparison | In-flight | "Genuinely ready" per architecture doc post-F1 fix; per bespoke decision now confounded — not runnable to produce a trustworthy answer |
|
||||
|
||||
The `e3_dreamer_substrate.py` script is the only one with current relevance: its run was the proximate cause of the bespoke decision (per the decision doc, running E6 on graphiti is "a vibe check" because of issue #1325 and friends). Code is functional; under the bespoke decision the experiment it runs cannot produce a trustworthy answer.
|
||||
|
||||
### Backup files (`.bak*`)
|
||||
|
||||
The following are point-in-time copies left behind by the rollback work. None is on any code path. They are documented as a group rather than individually:
|
||||
|
||||
- `api.py.bak.20260501-001427`
|
||||
- `consolidator_v0_1.py.bak` (pre-0.1.5-patch)
|
||||
- `corpus_integrity.py.bak.20260501-021703`
|
||||
- `dream.py.bak`, `dream.py.bak.20260501-002209`
|
||||
- `graphiti_service.py.bak`, `graphiti_service.py.bak.20260501-185619`, `graphiti_service.py.bak.20260502-022307`
|
||||
- `ingest.py.bak.20260501-004131`
|
||||
- `stage2_worker.py.bak.20260501-171928`, `.20260501-172531`, `.20260501-185942`
|
||||
- `stage3_worker.py.bak.20260501-050354`, `.20260501-050453`, `.20260501-050719`, `.20260501-173233`, `.20260501-190357`
|
||||
- `watcher.py.bak`, `watcher.py.bak.20260501-004131`
|
||||
|
||||
Stage 3 alone has five `.bak` versions, which matches the v2.0 → v2.1 → v2.2 patch history. Track 1 cleanup candidate: collapse all `.bak*` into a `deprecated/` or remove (git history is the durable artifact).
|
||||
|
||||
### `__pycache__/`
|
||||
|
||||
Compiled `.pyc` files for `api`, `corpus_integrity`, `dream`, `ingest`, `stage3_worker`, `st_embedder`, `watcher`. Notably *no* `.pyc` for `stage2_worker.py` — the worker imports under uvicorn's process lifecycle rather than via Python's standard import machinery, but that's a guess from absence; uncertain. Not a code path. Remove on next clean build if desired.
|
||||
|
||||
---
|
||||
|
||||
### Phase 1 summary
|
||||
|
||||
**Working and matching intent:**
|
||||
- `watcher.py` (Stage 1)
|
||||
- `ingest_conversations.py` (nightly conversation indexer)
|
||||
- `st_embedder.py`
|
||||
- `backup.sh`
|
||||
|
||||
**Working with behavior-vs-intent divergences:**
|
||||
- `api.py` — dead `/auth/check` reference; voice capture doesn't archive raw audio to `Journal/Media/`; `/api/corpus/retry` reintroduces 50KB truncation.
|
||||
- `dream.py` — cumulative 500-source exclusion across nights is a NREM-shape divergence: silently shrinks Early/Late REM's reachable corpus over time without architectural mandate. NREM exclusion fix is in place but the pattern that caused that bug exists at a different layer.
|
||||
- `ingest.py` — duplicates Stage 1 logic (F11), default behavior re-enqueues to Stage 2 on every reindex.
|
||||
- `stage2_worker.py` — works as designed; under the bespoke decision is doing work that's no longer the architectural target.
|
||||
- `corpus_integrity.py` — graphiti side of the report becomes semantically empty after Stage 3 shutoff.
|
||||
- `graphiti_service.py` — works as designed; same story as Stage 2 — not aligned with bespoke direction.
|
||||
|
||||
**Stopped / deprecated / experimental:**
|
||||
- `stage3_worker.py` — service stopped manually; code in repo, last-modified 2026-05-01.
|
||||
- `consolidator_v0_1.py` — reframe-deprecated.
|
||||
- `tier1_migration.py` — already-run one-shot, kept as reproducibility artifact.
|
||||
- All 32 experimental scripts in `scripts/` and `scripts/experiments/`.
|
||||
- `e3_dreamer_substrate.py` — in-flight per architecture doc, confounded per bespoke decision.
|
||||
|
||||
**Removal candidates (do not remove):**
|
||||
- All `.bak*` files (~20 of them) — git history covers them.
|
||||
- The 32 experimental scripts could move to `deprecated/` or `experiments/` to clean up `scripts/`.
|
||||
- `consolidator_v0_1.py` — explicitly deprecated by reframe.
|
||||
- `tier1_migration.py` — completed migration; kept for reproducibility.
|
||||
|
||||
**NREM-shaped divergences (the most important class of finding):**
|
||||
1. **`dream.py` cumulative exclusion 500-cap.** The `retrieved_sources` list grows across nights and is the exclusion set for Early REM and Late REM. After enough nights it reliably hides ~40% of the corpus. The architecture and reframe specify session-scoped novelty, not corpus-lifetime exclusion. Same shape as the NREM bug: a deduplication mechanism running silently in a way the architecture didn't request.
|
||||
2. **`api.py /api/corpus/retry` 50KB truncation.** The F14 fix removed truncation from `watcher.py`, `ingest.py`, `corpus_integrity.py`, but the api.py retry path was missed — clicking "Retry" on an ingest-failure still truncates. Working without errors, doing something the architecture explicitly says not to.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Systemd services
|
||||
|
||||
Inventory of every `aaronai*.service` and `aaronai*.timer` in `/etc/systemd/system/`. Status is from `systemctl is-enabled` and `systemctl is-active` taken during this session.
|
||||
|
||||
### `aaronai.service`
|
||||
- **Status:** Working (enabled, active)
|
||||
- **Unit-file mtime:** 2026-04-24
|
||||
- **Type / trigger:** `simple`, `Restart=always`, `WantedBy=multi-user.target`. Always-running.
|
||||
- **Command:** `/home/aaron/aaronai/venv/bin/python3 /home/aaron/aaronai/scripts/api.py`
|
||||
- **Depends on:** `network.target`
|
||||
- **What depends on it:** `aaronai-graphiti.service`, `aaronai-stage2.service`, `aaronai-stage3.service`, `aaronai-watcher.service` all `After=` it; `Requires=aaronai.service` on Stage 2 and Stage 3.
|
||||
- **Behavior matches intent?** Yes. Hosts the FastAPI backend and the embedded APScheduler. The architecture doc lists this as the long-running api.py process hosting nightly cycles.
|
||||
- **Notes:** No `WatchdogSec`. Restarts on crash. Has been "running since May 01" per the current-state doc.
|
||||
|
||||
### `aaronai-graphiti.service`
|
||||
- **Status:** Working (enabled, active)
|
||||
- **Unit-file mtime:** 2026-04-27
|
||||
- **Type / trigger:** `simple`, `Restart=always`, always-running.
|
||||
- **Command:** `/home/aaron/aaronai/venv/bin/python3 /home/aaron/aaronai/scripts/graphiti_service.py`
|
||||
- **Depends on:** `aaronai.service` (After=, soft); FalkorDB Docker container at `127.0.0.1:6379`; `.env`.
|
||||
- **What depends on it:** `aaronai-stage3.service` (Requires=); `dream.py` when `DREAMER_SUBSTRATE=graphiti`; the Stage 3 worker's `recover_wedge` does `sudo systemctl restart aaronai-graphiti.service`.
|
||||
- **Behavior matches intent?** Yes against architecture doc. Under bespoke decision this is the layer being replaced. Service still runs and the sidecar still answers `/health`.
|
||||
- **Notes:** The 2026-05-01 v2.1 patches (sudoers entry, error logging) are applied in the worker code that calls this; the service unit itself is unchanged.
|
||||
|
||||
### `aaronai-stage2.service`
|
||||
- **Status:** Working (enabled, active)
|
||||
- **Unit-file mtime:** 2026-05-01
|
||||
- **Type / trigger:** `simple`, `Restart=always`, `Requires=aaronai.service`. Always-running worker.
|
||||
- **Command:** `/home/aaron/aaronai/venv/bin/python3 /home/aaron/aaronai/scripts/stage2_worker.py`
|
||||
- **Depends on:** `aaronai.service` (Requires=); Ollama on 11434; `.env`.
|
||||
- **What depends on it:** Stage 3 worker (consumes the queue this fills).
|
||||
- **Behavior matches intent?** Yes for the worker code. Under the bespoke decision the cascade pipeline this feeds is no longer the architectural target — but the unit is doing what its code says.
|
||||
- **Notes:** `WatchdogSec` line is commented out (the 2026-05-01 fix). Logs to `/var/log/aaronai/stage2.log`.
|
||||
|
||||
### `aaronai-stage3.service`
|
||||
- **Status:** Stopped (enabled, **inactive**) — manually stopped per the session brief
|
||||
- **Unit-file mtime:** 2026-05-01
|
||||
- **Type / trigger:** `simple`, `Restart=always`, `Requires=aaronai.service aaronai-graphiti.service`. Would be always-running if started.
|
||||
- **Command:** `/home/aaron/aaronai/venv/bin/python3 /home/aaron/aaronai/scripts/stage3_worker.py`
|
||||
- **Depends on:** `aaronai.service` and `aaronai-graphiti.service` (both Requires=); `.env`; passwordless sudo for `systemctl restart aaronai-graphiti.service`.
|
||||
- **What depends on it:** Nothing technically requires it; corpus integrity reads `stage_3_queue.completed_at` and would see those numbers stop moving while the worker is off.
|
||||
- **Behavior matches intent?** **Divergence.** The unit is `enabled` (i.e., will start at next boot) but currently inactive. The bespoke decision parks this work; on reboot the service will start automatically and resume processing `stage_3_queue` rows. Track 1 cleanup should `systemctl disable` it before next reboot — otherwise the manual stop is a soft guarantee that doesn't survive a power cycle.
|
||||
- **Notes:** `WatchdogSec` line is commented out (the 2026-05-01 fix). Logs to `/var/log/aaronai/stage3.log`. The service file's `Description` still says "Graphiti cascade ingest" — accurate but architecturally stale under bespoke.
|
||||
|
||||
### `aaronai-watcher.service`
|
||||
- **Status:** Working (enabled, active)
|
||||
- **Unit-file mtime:** 2026-04-30
|
||||
- **Type / trigger:** `simple`, `Restart=always`. Always-running.
|
||||
- **Command:** `/home/aaron/aaronai/venv/bin/python3 /home/aaron/aaronai/scripts/watcher.py`
|
||||
- **Environment:** `TRANSFORMERS_OFFLINE=1`, `HF_HUB_OFFLINE=1`, `PATH=/home/aaron/aaronai/venv/bin`. Resource caps: `MemoryMax=3G`, `MemorySwapMax=0`.
|
||||
- **Depends on:** `aaronai.service` (After=); pgvector; SentenceTransformer model files (offline mode means they must already be cached).
|
||||
- **What depends on it:** Anything that reads pgvector or `stage_2_queue` indirectly depends on this filling them.
|
||||
- **Behavior matches intent?** Yes. Stage 1 architectural commitment. The 2026-04-30 in-process refactor matches the architecture doc.
|
||||
- **Notes:** `MemorySwapMax=0` is the post-refactor commitment. Watcher heartbeat at `/home/aaron/aaronai/watcher_heartbeat` is consumed by an external cron monitor (Phase 5 confirms).
|
||||
|
||||
### `aaronai-web.service`
|
||||
- **Status:** Working (enabled, active)
|
||||
- **Unit-file mtime:** 2026-04-26
|
||||
- **Type / trigger:** `simple`, `Restart=always`. Always-running.
|
||||
- **Command:** `/usr/bin/node node_modules/next/dist/bin/next start` from `/home/aaron/aaronai-web` with `NODE_ENV=production` and `PORT=3000`.
|
||||
- **Depends on:** `network.target`.
|
||||
- **What depends on it:** nginx reverse-proxies to port 3000 (per architecture doc); Cloudflare-fronted `ai.aaronnelson.studio`.
|
||||
- **Behavior matches intent?** Yes. Hosts the Next.js frontend per Layer 3 architecture.
|
||||
- **Notes:** Working directory is `~/aaronai-web/` not `~/projects/aaronai-web/` — production deployment is a separate clone of the repo. This is consistent with the architecture doc's "Local: `~/projects/aaronai-web/`, deployed: `~/aaronai-web/`" line.
|
||||
|
||||
### `aaronai-dreamer.service`
|
||||
- **Status:** Working (oneshot; static)
|
||||
- **Unit-file mtime:** 2026-04-26
|
||||
- **Type / trigger:** `Type=oneshot`. Not directly schedulable from systemd (no `[Install]` block — `static`).
|
||||
- **Command:** `/home/aaron/aaronai/venv/bin/python3 /home/aaron/aaronai/scripts/dream.py --mode nrem`
|
||||
- **Depends on:** `network.target`.
|
||||
- **What depends on it:** The session brief noted this service was used for the manual NREM run on 2026-05-02 21:33-21:34 UTC. APScheduler in `api.py` is the production trigger and uses `subprocess.Popen` directly (not this unit) — the unit is only for manual `systemctl start aaronai-dreamer.service` from the shell.
|
||||
- **Behavior matches intent?** Partial. The unit exists and is the only systemd-tracked dream entry point. **It still hardcodes `--mode nrem`** as the command, so a manual `systemctl start aaronai-dreamer.service` runs only NREM, not the full pipeline. The architecture says nightly is full pipeline; the production scheduler in api.py runs `dream.py` with no flag (i.e., default pipeline). The unit's `--mode nrem` is therefore an outdated invocation pattern preserved from when individual stages were run by hand.
|
||||
- **Notes:** Has a paired `aaronai-dreamer.timer` (next entry) that is **not enabled**. APScheduler is the only thing actually triggering nightly dreams.
|
||||
|
||||
### `aaronai-dreamer.timer`
|
||||
- **Status:** Stopped — exists but **not in `timers.target.wants/`**, so not enabled
|
||||
- **Unit-file mtime:** 2026-04-27
|
||||
- **Schedule:** `OnCalendar=*-*-* 08:00:00`, `Persistent=true`.
|
||||
- **Triggers:** `aaronai-dreamer.service`
|
||||
- **Behavior matches intent?** Divergence — duplicate scheduling. APScheduler in `api.py` drives the actual 08:00 UTC dream run. This timer would do the same thing (with the wrong invocation — `--mode nrem`) if it were enabled. **NREM-shape divergence: a scheduling mechanism present, configured, and inactive — but its presence will confuse a future reader about who triggers the dream.** Track 1 cleanup candidate: remove or disable explicitly.
|
||||
|
||||
### `aaronai-index-conversations.service`
|
||||
- **Status:** Working (oneshot; static)
|
||||
- **Unit-file mtime:** 2026-04-26
|
||||
- **Type / trigger:** `Type=oneshot`. Static, no Install section.
|
||||
- **Command:** `/home/aaron/aaronai/venv/bin/python3 /home/aaron/aaronai/scripts/ingest_conversations.py`
|
||||
- **Depends on:** `network.target`.
|
||||
- **What depends on it:** Manually triggerable. APScheduler in `api.py` runs `ingest_conversations.py` directly via `subprocess.run` — not this unit.
|
||||
- **Behavior matches intent?** Same shape as the dreamer unit: an alternate entry point that exists for manual debugging. Not on a path that fires.
|
||||
- **Notes:** Logs to `/home/aaron/aaronai/dreamer.log` — same log file as the dreamer service (likely a copy-paste artifact, not a deliberate co-mingling).
|
||||
|
||||
### `aaronai-index-conversations.timer`
|
||||
- **Status:** Stopped — not enabled
|
||||
- **Unit-file mtime:** 2026-04-26
|
||||
- **Schedule:** `OnCalendar=*-*-* 02:30:00`, `Persistent=true`.
|
||||
- **Triggers:** `aaronai-index-conversations.service`
|
||||
- **Behavior matches intent?** Same divergence pattern as `aaronai-dreamer.timer`. APScheduler in `api.py` is the real driver at 02:30 UTC. This timer is dormant and would silently double-fire the job if enabled.
|
||||
|
||||
### `aaronai-maintenance.service`
|
||||
- **Status:** Broken (oneshot; static; **command is unrunnable**)
|
||||
- **Unit-file mtime:** 2026-04-26
|
||||
- **Type / trigger:** `Type=oneshot`. Static.
|
||||
- **Command:** `/home/aaron/aaronai/venv/bin/chops hnsw rebuild --path /home/aaron/aaronai/db --collection aaronai`
|
||||
- **Depends on:** `chops` binary in venv, ChromaDB at `/home/aaron/aaronai/db/`.
|
||||
- **What depends on it:** Nothing. `aaronai-maintenance.timer` would trigger it weekly if enabled, but the timer is not enabled.
|
||||
- **Behavior matches intent?** **No.** This unit is from the ChromaDB era. The architecture doc records the ChromaDB → pgvector migration on 2026-04-26. Verified during this inventory: `chops` is **not present** in `~/aaronai/venv/bin/`, and `~/aaronai/db/` still contains `chroma.sqlite3` and a UUID-named subdirectory but is no longer the active corpus store. **If anyone ever ran `systemctl start aaronai-maintenance.service`, it would fail with command-not-found.**
|
||||
- **Notes:** Track 1 removal candidate. Both this and its timer are pure dead state; the `~/aaronai/db/` directory is a separate cleanup decision (it holds historical ChromaDB data, possibly recoverable).
|
||||
|
||||
### `aaronai-maintenance.timer`
|
||||
- **Status:** Stopped — not enabled
|
||||
- **Unit-file mtime:** 2026-04-26
|
||||
- **Schedule:** `OnCalendar=Sun *-*-* 04:00:00`, `Persistent=true`.
|
||||
- **Triggers:** `aaronai-maintenance.service` (broken).
|
||||
- **Behavior matches intent?** No — points at a broken service.
|
||||
- **Notes:** Track 1 removal candidate.
|
||||
|
||||
---
|
||||
|
||||
### Phase 2 summary
|
||||
|
||||
**Working and matching intent:**
|
||||
- `aaronai.service`
|
||||
- `aaronai-graphiti.service` (matches the existing-architecture intent; bespoke decision will replace the layer it serves)
|
||||
- `aaronai-stage2.service` (same caveat)
|
||||
- `aaronai-watcher.service`
|
||||
- `aaronai-web.service`
|
||||
|
||||
**Working with behavior-vs-intent divergences:**
|
||||
- `aaronai-dreamer.service` — hardcodes `--mode nrem`; production trigger is APScheduler running default pipeline. The systemd entry-point and the production entry-point disagree about what "dream" means.
|
||||
|
||||
**Stopped / broken:**
|
||||
- `aaronai-stage3.service` — manually stopped 2026-05-02; **still `enabled` so will autostart on next reboot**.
|
||||
- `aaronai-dreamer.timer`, `aaronai-index-conversations.timer` — not enabled; redundant with APScheduler.
|
||||
- `aaronai-maintenance.service` and `aaronai-maintenance.timer` — broken (`chops` not installed); ChromaDB-era leftover.
|
||||
- `aaronai-index-conversations.service` — static, harmless oneshot wrapper.
|
||||
|
||||
**Removal candidates (do not remove):**
|
||||
- `aaronai-maintenance.service` and `.timer`
|
||||
- `aaronai-dreamer.timer`, `aaronai-index-conversations.timer` (or, alternatively, disable APScheduler and use the timers — the duplication is the problem, not the choice)
|
||||
- `aaronai-stage3.service` should be `disabled` even if not removed, so the manual-stop survives a reboot.
|
||||
|
||||
**NREM-shaped divergences in Phase 2:**
|
||||
1. **`aaronai-stage3.service` is `enabled` but `inactive`.** Manual stop does not survive reboot; on next reboot the worker resumes against `stage_3_queue`, which is being filled by Stage 2. Same shape as the NREM bug: the operationally-stopped state is paper-thin. The architecture's stated "service stopped" intent is undermined by a `systemctl is-enabled` value nobody changed.
|
||||
2. **`aaronai-maintenance.service` against ChromaDB.** Service is configured, would attempt to run if its (disabled) timer fired, would fail. The architectural intent (ChromaDB retired) and the systemd state (unit still installed and enabled-static) are out of sync. The disabled timer is the only thing protecting against running this.
|
||||
3. **Triple-scheduled triggers.** APScheduler in api.py + dreamer/index-conversations timer files = two competing schedulers configured for the same nightly work. Only APScheduler is firing; the other is dormant. This is exactly the mechanism-still-present-but-not-architecturally-intended pattern.
|
||||
|
||||
---
|
||||
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — Database tables
|
||||
|
||||
PostgreSQL `aaronai` database, `public` schema. Five tables. Connected via `PG_DSN` from `.env` (value not echoed in this document). All queries `SELECT`-only and `\d`-style. Counts taken during this session.
|
||||
|
||||
### `embeddings`
|
||||
- **Status:** Working (the production retrieval substrate)
|
||||
- **Columns:**
|
||||
- `id text NOT NULL` (PK)
|
||||
- `document text NOT NULL` (chunk content)
|
||||
- `embedding USER-DEFINED` (pgvector `vector(384)`)
|
||||
- `source text` (filename/conversation title)
|
||||
- `type text` (document / chatgpt_conversation / claude_conversation / aaronai_conversation / claude_memory / NULL)
|
||||
- `created_at text` (string-typed, not timestamptz; many rows NULL)
|
||||
- `metadata jsonb`
|
||||
- **Indexes:**
|
||||
- `embeddings_pkey` btree on `id`
|
||||
- `embeddings_vector_idx` HNSW (m=16, ef_construction=64, vector_cosine_ops)
|
||||
- `embeddings_source_idx` btree on `source`
|
||||
- **Row count:** 13,874
|
||||
- **Distinct sources:** 1,236
|
||||
- **Type distribution:** `document` 1,368 | `chatgpt_conversation` 1,548 | `claude_conversation` 1,074 | `aaronai_conversation` 68 | `claude_memory` 1 | NULL 9,815
|
||||
- **Writes:** `watcher.py:ingest_file()`, `ingest.py:ingest_file()`, `ingest_conversations.py:run()`, `corpus_integrity.py:queue_for_retry()` (writes to `stage_2_queue`, not here — but on a normal ingest path the chunks land here)
|
||||
- **Reads:** `api.py:retrieve_context()`, `dream.py:retrieve()` (pgvector branch), `corpus_integrity.py`, `tier1_migration.py:fetch_tier1_sources()`, several experiment scripts
|
||||
- **Behavior matches intent?** Partial. **9,815 of 13,874 rows have `type IS NULL` (~71%)** — this is unexpected given the architecture doc's commitment to typing every chunk. Looking at the code, `watcher.py:ingest_file()` writes `type='document'` and `ingest_conversations.py` writes `'aaronai_conversation'`. The 9,815 NULLs are likely artifacts of older ingest runs or `ingest_chatgpt.py`/`ingest_claude.py` (referenced in the architecture doc but not present in `scripts/` — possibly run as one-shots from an earlier point and deleted). **Additionally, `created_at` is stored as `text` rather than `timestamptz`**, and 12,109 rows have it NULL. Both are NREM-shape divergences: data fields the architecture treats as load-bearing for "temporal awareness" exist in the schema but are mostly empty or mistyped.
|
||||
- **Notes:** HNSW index parameters match the doc. The vector dimension is 384 (matches `all-MiniLM-L6-v2`).
|
||||
|
||||
### `stage_2_queue`
|
||||
- **Status:** Working (active queue feeding stage2_worker)
|
||||
- **Columns:**
|
||||
- `id integer NOT NULL` (PK, sequence)
|
||||
- `source text NOT NULL UNIQUE`
|
||||
- `full_text text NOT NULL` (no longer truncated post-F14)
|
||||
- `char_length integer NOT NULL`
|
||||
- `enqueued_at timestamptz NOT NULL default NOW()`
|
||||
- `started_at`, `completed_at`, `failed_at` timestamptz nullable
|
||||
- `failure_reason text`
|
||||
- `attempts integer NOT NULL default 0`
|
||||
- **Indexes:** PK + unique on `source`.
|
||||
- **Row count:** 48 (25 completed, 21 failed, 2 pending)
|
||||
- **Failure breakdown:**
|
||||
- `park_pending_phase_2_reframe` — 19 rows (manually-marked, the parked meta-documents per the reframe)
|
||||
- `mistral_timeout_after_300s` — 2 rows
|
||||
- **Last enqueued:** 2026-05-02 22:22 UTC
|
||||
- **Last completed:** 2026-05-02 22:33 UTC
|
||||
- **Writes:** `watcher.py:enqueue_stage2()`, `ingest.py:enqueue_stage2()`, `corpus_integrity.py:queue_for_retry()`, `api.py:/api/corpus/retry`, `stage2_worker.py` (updates state)
|
||||
- **Reads:** `stage2_worker.py:run()`
|
||||
- **Behavior matches intent?** Yes. The queue is doing what it was redesigned to do post-F14. The 19 manually-parked rows match the reframe doc's mention of parked meta-documents.
|
||||
- **Notes:** **The watcher is still actively enqueuing rows at 2026-05-02 22:22 — meaning Stage 2 is still consuming the queue and feeding Stage 3.** This is fine architecturally for now, but worth flagging given Stage 3 is stopped (Phase 2). See Phase 3 summary divergence #1.
|
||||
|
||||
### `stage_3_queue`
|
||||
- **Status:** Working-degraded
|
||||
- **Columns (base):**
|
||||
- `id integer NOT NULL` (PK, sequence)
|
||||
- `source text NOT NULL UNIQUE`
|
||||
- `full_text text NOT NULL`
|
||||
- `orientation text NOT NULL`
|
||||
- `stage2_metadata jsonb`
|
||||
- `enqueued_at timestamptz NOT NULL default NOW()`
|
||||
- `started_at`, `completed_at`, `failed_at` timestamptz nullable
|
||||
- `failure_reason text`
|
||||
- `attempts integer NOT NULL default 0`
|
||||
- **Columns (rolled-back-migration leftovers, all unused by current code):**
|
||||
- `state_type text` (added by `30beeb3`, unused)
|
||||
- `state_type_confidence text` (unused)
|
||||
- `supersedes_prior_state boolean` (unused)
|
||||
- `state_type_rationale text` (unused)
|
||||
- `external_job_id uuid` (added by `a0bf280`, unused)
|
||||
- **Indexes:**
|
||||
- `stage_3_queue_pkey`
|
||||
- `stage_3_queue_source_key` (unique on source)
|
||||
- `stage_3_queue_supersedes_idx` btree on `supersedes_prior_state` — unused
|
||||
- `idx_stage_3_queue_external_job` partial btree on `external_job_id` where not-null and not-completed/failed — unused
|
||||
- **Row count:** 19 (11 completed, 3 failed, 6 pending). 1 row has `state_type` populated (the smoke-test); 0 have `external_job_id`.
|
||||
- **Failure breakdown:**
|
||||
- 2 × `HTTPConnectionPool(host='localhost', port=8001): Read timed out. (read timeout=600)` (the May-1 incident period)
|
||||
- 1 × `Bulk path against new content unpatched; deferred until search_utils.py sites 4-9 are patched` (rolled-back work artifact)
|
||||
- **Last enqueued:** 2026-05-02 22:33 UTC (Stage 2 just enqueued a row).
|
||||
- **Writes:** `stage2_worker.py:enqueue_stage3()`, `stage3_worker.py` (state updates).
|
||||
- **Reads:** `stage3_worker.py:run()`, `corpus_integrity.py:get_graphiti_sources()`, `api.py:get_corpus_status_data()`.
|
||||
- **Behavior matches intent?** **Partial / multiple divergences.**
|
||||
- 5 columns and 2 indexes from rolled-back migrations remain. Inert under current code, but they are visible to anyone reading the schema and will mislead. The current-state doc said `idx_stage_3_queue_supersedes` "may also still exist" — confirmed: it does, **plus** `idx_stage_3_queue_external_job` which the current-state doc didn't mention.
|
||||
- The queue is filling without a consumer. Stage 3 worker is stopped (Phase 2); Stage 2 worker is enqueuing. As of 22:33 UTC there are 6 pending rows.
|
||||
- **Notes:** Cleanup SQL is in the current-state doc. Track 1 candidate for removal (low priority — no harm in leaving).
|
||||
|
||||
### `graphiti_jobs`
|
||||
- **Status:** Working-degraded (rolled-back-code artifact)
|
||||
- **Columns:**
|
||||
- `job_id uuid NOT NULL` (PK)
|
||||
- `job_type text NOT NULL`
|
||||
- `payload jsonb NOT NULL`
|
||||
- `status text NOT NULL default 'queued'`
|
||||
- `enqueued_at timestamptz NOT NULL default NOW()`
|
||||
- `started_at`, `finished_at` timestamptz nullable
|
||||
- `error text`
|
||||
- `summary jsonb`
|
||||
- `submitted_by text`
|
||||
- **Indexes:**
|
||||
- `graphiti_jobs_pkey`
|
||||
- `idx_graphiti_jobs_queued` partial btree on `enqueued_at` where status='queued'
|
||||
- `idx_graphiti_jobs_status` btree on `status`
|
||||
- **Row count:** **9 (NOT empty)** — 6 failed, 3 committed.
|
||||
- **Activity window:** All 9 jobs from 2026-05-02 02:26 UTC to 2026-05-02 05:50 UTC — last night's experimental run, before the rollback. Mix of `single` and `bulk` job types.
|
||||
- **Writes:** None in current code. The Pattern 1 async-job consumer/producer was rolled back.
|
||||
- **Reads:** None in current code.
|
||||
- **Behavior matches intent?** **No.** The current-state doc said this table "exists, empty (or near-empty)". It is not empty — 9 jobs from the May-2 experimental run remain. They are inert (nothing reads or writes the table now), but the documented state and the actual state disagree. Drop the table per the current-state doc's cleanup SQL.
|
||||
- **Notes:** Two of the 6 failures have `started_at IS NULL` and a non-null `finished_at` — those are jobs that were marked failed without ever being claimed by a worker. Pattern in the rolled-back code. Of historical interest only.
|
||||
|
||||
### `ingest_failures`
|
||||
- **Status:** Working
|
||||
- **Columns:**
|
||||
- `id integer NOT NULL` (PK, sequence)
|
||||
- `source text NOT NULL UNIQUE`
|
||||
- `filepath text NOT NULL`
|
||||
- `error text NOT NULL`
|
||||
- `retry_count integer NOT NULL default 0`
|
||||
- `first_failed_at`, `last_failed_at` timestamptz default NOW()
|
||||
- `resolved boolean NOT NULL default false`
|
||||
- `category text NOT NULL default 'transient'`
|
||||
- **Indexes:** PK + unique on `source`.
|
||||
- **Row count:** 129 (all `category='unreadable'`, all `resolved=false`)
|
||||
- **Writes:** `watcher.py:record_ingest_failure()`, `corpus_integrity.py` (auto-queue path), `api.py:/api/corpus/retry`
|
||||
- **Reads:** `api.py:get_corpus_status_data()`, `corpus_integrity.py:get_ingest_failures()`
|
||||
- **Behavior matches intent?** Yes. Matches the architecture's "ingest_failures table for UI visibility" tech-debt-resolved entry. The 129 unreadable files match the 129 figure cited in the architecture doc — these are scanned/encrypted/corrupt PDFs awaiting OCR (priority 21b).
|
||||
- **Notes:** The `category` field has only one observed value (`'unreadable'`); `'transient'` is the default but no rows currently carry it. Consistent with the architecture: only persistent failures (after watcher retry) make it here.
|
||||
|
||||
---
|
||||
|
||||
### Phase 3 summary
|
||||
|
||||
**Working and matching intent:**
|
||||
- `ingest_failures` (129 unreadable, awaiting OCR, all matches doc)
|
||||
- `stage_2_queue` (functioning queue, post-F14)
|
||||
|
||||
**Working with behavior-vs-intent divergences:**
|
||||
- `embeddings` — 71% of rows have `type IS NULL`; 87% have `created_at IS NULL`; `created_at` is `text`-typed not timestamptz. The temporal-awareness commitment in the architecture is largely unsupported by the data actually in the table.
|
||||
- `stage_3_queue` — five rolled-back-migration columns and two unused indexes remain; queue is being filled by Stage 2 with no consumer running.
|
||||
|
||||
**Broken / rolled-back:**
|
||||
- `graphiti_jobs` — 9 rows from the rolled-back experimental work; current-state doc says "empty"; reality says otherwise. No current code touches it.
|
||||
|
||||
**Removal candidates (do not remove):**
|
||||
- `stage_3_queue` columns: `state_type`, `state_type_confidence`, `supersedes_prior_state`, `state_type_rationale`, `external_job_id` and the two related indexes.
|
||||
- `graphiti_jobs` table entirely.
|
||||
- `embeddings.created_at` — under bespoke, the new substrate's temporal model replaces this; the column probably gets dropped in the bespoke build.
|
||||
|
||||
**NREM-shaped divergences in Phase 3:**
|
||||
1. **Stage 2 still enqueues to Stage 3 while Stage 3 is stopped.** Pending count grows over time. There is no architectural-level decision to do this; it's a consequence of leaving Stage 2 running while turning off its consumer. The pending rows are inert until a consumer attaches, but the design says one queue stage feeds the next — and the consumer is gone. Same shape: a pipeline working "without errors" and producing state nobody is consuming.
|
||||
2. **`embeddings.type` is NULL for 71% of rows.** The architecture treats `type` as a load-bearing field for distinguishing document vs conversation chunks at retrieval time. In production, more than two-thirds of chunks lack the field. Retrieval still works because nothing routes on `type`. The mechanism is in place, doing nothing visible, and the absence is invisible to anyone not querying the schema directly.
|
||||
3. **`embeddings.created_at` is `text`-typed and 87% NULL.** Same shape: the doc treats temporal awareness as architectural; the data shape doesn't support time-based queries even where the column exists.
|
||||
4. **`graphiti_jobs` documented as empty, actually has 9 rows.** Current-state doc explicitly anticipates the wrong state. Verifying the doc against the database surfaced this.
|
||||
|
||||
---
|
||||
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — Configuration
|
||||
|
||||
### `~/aaronai/.env`
|
||||
|
||||
Eight keys present. **Values redacted in this document; only key name, length, and shape are reported.**
|
||||
|
||||
| Key | Length | Shape | Used by | Still referenced? |
|
||||
|---|---|---|---|---|
|
||||
| `ANTHROPIC_API_KEY` | 108 | opaque | `api.py` (Anthropic client), `dream.py:_call_claude`, `graphiti_service.py` (as fallback when `LLM_API_KEY` unset), several experiment scripts | Yes |
|
||||
| `AARON_AI_PASSWORD` | 16 | opaque | `api.py:/auth/login` | Yes |
|
||||
| `NEXTCLOUD_URL` | 36 | uri | `api.py` capture endpoint, `dream.py:deliver` | Yes |
|
||||
| `NEXTCLOUD_USER` | 5 | opaque | Same as above | Yes |
|
||||
| `NEXTCLOUD_PASSWORD` | 29 | opaque | Same — WebDAV app password | Yes |
|
||||
| `PG_DSN` | 71 | opaque (postgres connection string) | Every Postgres-touching script (`api.py`, `dream.py`, `watcher.py`, `ingest.py`, `ingest_conversations.py`, both workers, `corpus_integrity.py`, `tier1_migration.py`, all experiment scripts) | Yes |
|
||||
| `LLM_PROVIDER` | 9 | opaque (matches `"anthropic"`) | `graphiti_service.py:get_llm_client` | Yes (graphiti only) |
|
||||
| `LLM_MODEL` | 25 | opaque (matches `"claude-sonnet-4-6"` length) | `graphiti_service.py` | Yes (graphiti only) |
|
||||
|
||||
**Variables documented in the architecture doc but NOT present in `.env`:**
|
||||
- `LLM_API_KEY` — architecture doc table lists it. `graphiti_service.py` reads `LLM_API_KEY` first, falls back to `ANTHROPIC_API_KEY`. Current behavior depends on the fallback. Architecturally fine, but the "user brings their own key" LLM-agnostic framing (architecture doc Section 5) is achieved by a fallback rather than an explicit key. Track 1 candidate: either set `LLM_API_KEY` explicitly or remove the unused fallback path from the doc.
|
||||
- `FALKORDB_HOST`, `FALKORDB_PORT`, `GRAPHITI_GROUP_ID` — referenced in `graphiti_service.py` with defaults (`localhost`, `6379`, `aaron`). Defaults are correct for current deployment; absence from `.env` is fine. Worth flagging only because the architecture doc lists `group_id="aaron"` as a single-tenant assumption (F26).
|
||||
|
||||
**Variables loaded but worth flagging:**
|
||||
- All Postgres-touching scripts call `load_dotenv(Path.home() / "aaronai" / ".env", override=True)` (or without `override`). Different scripts use different override behavior; this is harmless but inconsistent.
|
||||
|
||||
**Behavior matches intent?** Partial. The `.env` file works; the documented LLM-agnostic story is a fallback story, not an enforced one. Permissions are `chmod 600` per the architecture commitment (file mode confirmed in earlier pass).
|
||||
|
||||
### `~/aaronai/settings.json`
|
||||
|
||||
Active contents:
|
||||
|
||||
```json
|
||||
{
|
||||
"theme": "light",
|
||||
"font_size": "medium",
|
||||
"web_search": true,
|
||||
"show_sources": true
|
||||
}
|
||||
```
|
||||
|
||||
`api.py:DEFAULT_SETTINGS` (line 46) defines a wider key set:
|
||||
|
||||
```python
|
||||
{
|
||||
"theme": "light",
|
||||
"font_size": "medium",
|
||||
"web_search": True,
|
||||
"show_sources": True,
|
||||
"dream_hour_utc": 8,
|
||||
"dream_minute_utc": 0,
|
||||
"dream_mode": "nrem",
|
||||
"ingest_hour_utc": 2,
|
||||
"ingest_minute_utc": 30,
|
||||
"share_time": True,
|
||||
}
|
||||
```
|
||||
|
||||
`load_settings()` merges file over defaults; `save_settings()` writes whatever it is given. The file currently holds only the four UI-tunable keys. The other six are loaded from defaults.
|
||||
|
||||
**What is referenced by current code:**
|
||||
- `theme`, `font_size` — frontend only (Phase 6)
|
||||
- `web_search` — `api.py:chat()` (line 307) — toggles the web_search tool block
|
||||
- `show_sources` — `api.py:/api/chat` (line 521) — gates whether sources are returned in the chat response
|
||||
- `dream_hour_utc`, `dream_minute_utc` — `api.py:reschedule_jobs()` (line 1149)
|
||||
- `ingest_hour_utc`, `ingest_minute_utc` — `api.py:reschedule_jobs()` (line 1159)
|
||||
- `dream_mode` — present in defaults; **not read anywhere in `api.py` or `dream.py`**. Searching the codebase: `dream_mode` appears only in `DEFAULT_SETTINGS` and the `schedule_keys` set in `update_settings`; `run_dream_job` always invokes `dream.py` with no flag (full pipeline). The setting is dead from the scheduler's perspective — it may be read by the frontend SettingsPanel for the default value of the on-demand "Dream Now" mode dropdown (Phase 6).
|
||||
- `share_time` — **frontend-controlled UI flag, backend stores-and-returns.** The backend persists it via `/api/settings` but does not act on its value. Frontend reads it at `MessageInput.tsx:58` and `SettingsPanel.tsx:205` (both with `?? true` fallback) and writes it back through the SettingsPanel toggle. The flag gates whether `client_time` is included in the `/api/chat` request payload (`lib/api.ts:51-57`); when off, the request omits the key and the backend's unconditional prompt-side insertion at `chat()` line 293 has nothing to insert. *Verified by cross-repo grep 2026-05-02 — the original "frontend-only or dead" / "removal candidate" framing was wrong; this is a working persistence pattern, structurally distinct from `dream_mode`.*
|
||||
|
||||
**Behavior matches intent?** Partial — but the two suspect keys behave very differently and should not be lumped together. **`dream_mode` is a NREM-shape divergence:** it reads as a configurable scheduling parameter (declared in `DEFAULT_SETTINGS`, listed in `schedule_keys` for the reschedule trigger), but `run_dream_job` ignores it. A future maintainer flipping the value expects different nightly behavior and gets none. **`share_time`, in contrast, is a backend-stores-and-returns persistence pattern** — the backend correctly persists a frontend-owned flag and the frontend acts on it (with a `?? true` fallback if the key is missing). The distinction matters: removing a silently-ignored key removes dead code, while removing a stores-and-returns key changes the seed default for new users. *Verification finding 2026-05-02 (cross-repo grep against `~/aaronai-web`).*
|
||||
|
||||
---
|
||||
|
||||
### Phase 4 summary
|
||||
|
||||
**Working and matching intent:**
|
||||
- All eight `.env` keys are referenced by code.
|
||||
- The four-key `settings.json` reflects the UI-tunable preferences.
|
||||
|
||||
**Working with behavior-vs-intent divergences:**
|
||||
- `LLM_API_KEY` documented but not set; relies on `ANTHROPIC_API_KEY` fallback.
|
||||
- `dream_mode` exists in defaults but isn't read by the scheduler.
|
||||
|
||||
**Removal candidates (do not remove):**
|
||||
- `dream_mode` — clarify in code or remove from defaults. *(`share_time` was previously listed here in error; cross-repo grep 2026-05-02 confirmed it is a working frontend-controlled flag, not a removal candidate.)*
|
||||
|
||||
**NREM-shaped divergences in Phase 4:**
|
||||
1. **`dream_mode` setting silently ignored.** A scheduler-shaped knob that exists, has a default, is mergeable from settings.json, and is not used. Future maintainer flipping it expects different nightly behavior; gets none.
|
||||
|
||||
---
|
||||
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 — Cron and scheduled work
|
||||
|
||||
### User crontab (`crontab -l`)
|
||||
|
||||
Two active entries:
|
||||
|
||||
| Schedule | Command | What it does |
|
||||
|---|---|---|
|
||||
| `0 3 * * *` (daily 03:00 UTC) | `/bin/bash /home/aaron/aaronai/scripts/backup.sh` | Snapshots `memory.md`, `settings.json`, `conversations.db` into `Nextcloud/Admin/Backups/`. 7-day retention. |
|
||||
| `*/5 * * * *` (every 5 min) | `test $(( $(date +%s) - $(cat /home/aaron/aaronai/watcher_heartbeat 2>/dev/null || echo 0) )) -gt 600 && sudo systemctl restart aaronai-watcher >> /var/log/aaronai/watcher-cron.log 2>&1` | Heartbeat watchdog. Restarts the watcher service if the heartbeat file is older than 600 seconds. |
|
||||
|
||||
**Behavior matches intent?** Yes. The watcher heartbeat watchdog corresponds to the architecture-doc tech-debt entry "Heartbeat file written every 5s … cron job restarts watcher if heartbeat older than 10 minutes." The 600s threshold matches the doc's "10 minutes" figure. `backup.sh` is on the documented daily schedule.
|
||||
|
||||
**Notes:** The watcher-restart entry uses passwordless `sudo` for `systemctl restart aaronai-watcher`. This is **not** in `/etc/sudoers.d/aaron-aaronai` (which the session brief lists as containing `restart ollama` and `restart aaronai-graphiti.service`). Either it's in `/etc/sudoers` proper (the original `aaronai-web` line area), or the cron entry is silently failing on every fire. Worth verifying — the cron line redirects stderr to the log, so a `sudo: password required` would be in `watcher-cron.log` (which I haven't read here).
|
||||
|
||||
### `/etc/cron.d/`
|
||||
|
||||
Stock OS files only: `certbot`, `e2scrub_all`, `sysstat`, plus the standard `cron.daily`/`cron.weekly`/`cron.hourly` directories with default Ubuntu cron jobs (`apport`, `apt-compat`, `dpkg`, `logrotate`, `man-db`, `sysstat`). **No aaronai-specific entries** in `/etc/cron.d/` or anywhere outside the user crontab.
|
||||
|
||||
`/etc/anacrontab` is not present.
|
||||
|
||||
Root crontab not inspected (sudo required; not granted in this read-only inventory pass).
|
||||
|
||||
### APScheduler jobs in `api.py`
|
||||
|
||||
`api.py:reschedule_jobs()` (line 1137) configures two jobs against an in-process `BackgroundScheduler`. The scheduler starts in the FastAPI lifespan; jobs are re-registered any time settings that contain a schedule key are updated.
|
||||
|
||||
| Job ID | Trigger | Function | What it does |
|
||||
|---|---|---|---|
|
||||
| `dream_job` | Cron, `hour=settings.dream_hour_utc`, `minute=settings.dream_minute_utc`, `tz=UTC` (default 08:00) | `run_dream_job` (line 1107) | `subprocess.run([PYTHON, dream.py], timeout=600)` — invokes the dreamer with no arguments → defaults to full pipeline (NREM → Early REM → Late REM → Synthesis). |
|
||||
| `ingest_job` | Cron, `hour=settings.ingest_hour_utc`, `minute=settings.ingest_minute_utc`, `tz=UTC` (default 02:30) | `run_ingest_job` (line 1123) | `subprocess.run([PYTHON, ingest_conversations.py], timeout=300)`. |
|
||||
|
||||
Both `max_instances=1`, both `replace_existing=True`. Settings changes that touch the schedule keys re-register the jobs.
|
||||
|
||||
**Behavior matches intent?** Mostly yes. The architecture's "Nightly Schedule" section says 02:30 UTC for conversation indexing and 08:00 UTC for the dream pipeline; both match. **One divergence:** `run_dream_job` uses `subprocess.run` (synchronous, with a 600s timeout). For a normal full-pipeline run this is enough, but Phase 5 of the reframe / E6 work would want longer runs — this is a soft cap nobody has hit yet. Architecture doc doesn't specify; flagging in case future longer runs need a bump.
|
||||
|
||||
**Notes:** The 600s `subprocess.run` timeout is the only thing protecting the FastAPI process from a stuck dreamer. If the dreamer hangs (e.g., Anthropic API stall), the scheduler thread holds for 10 minutes before the timeout fires. Acceptable but worth knowing.
|
||||
|
||||
### Systemd timers
|
||||
|
||||
Already documented in Phase 2 — three timer files exist (`aaronai-dreamer.timer`, `aaronai-index-conversations.timer`, `aaronai-maintenance.timer`), **none of them enabled** (none in `/etc/systemd/system/timers.target.wants/`). They duplicate (or, for maintenance, point at a broken service). APScheduler is the actual driver for the two paths the dreamer/ingest timers would cover.
|
||||
|
||||
### What is *not* scheduled
|
||||
|
||||
The architecture and reframe documents reference several mechanisms that have no scheduled runner today:
|
||||
- **Asynchronous dreamer pruning pass** (per reframe). Designed but unimplemented; no schedule.
|
||||
- **Consolidator 0.1 alias resolution.** The script exists, has no schedule, was always run by hand. Track 1 will dissolve it.
|
||||
- **`corpus_integrity.py` reconciliation.** Designed to be runnable on demand or via the SettingsPanel. No automated weekly run; the 129 unreadable files have been sitting at zero `retry_count` since the OCR (priority 21b) hasn't shipped.
|
||||
- **`tier1_migration.py`** has no schedule (one-shot, already complete).
|
||||
|
||||
---
|
||||
|
||||
### Phase 5 summary
|
||||
|
||||
**Working and matching intent:**
|
||||
- User crontab (backup + watcher heartbeat watchdog).
|
||||
- APScheduler jobs (dream + ingest_conversations) match the architecture doc's nightly schedule.
|
||||
|
||||
**Working with behavior-vs-intent divergences:**
|
||||
- The watcher-restart cron uses `sudo systemctl restart aaronai-watcher`, but the only sudoers entry for aaron is for ollama and aaronai-graphiti. The line either depends on a sudoers entry not documented in the session brief, or fails silently. **Worth verifying as part of Track 1.**
|
||||
- `dream_job` uses 600s `subprocess.run` timeout — soft cap nobody has hit, but tightens the operational envelope for any future longer-running dream work.
|
||||
|
||||
**Stopped / dormant:**
|
||||
- All three `aaronai-*.timer` units (Phase 2). They are configured, not enabled, and overlap APScheduler.
|
||||
|
||||
**Removal candidates (do not remove):**
|
||||
- The three `aaronai-*.timer` files.
|
||||
|
||||
**NREM-shaped divergences in Phase 5:**
|
||||
1. **Watcher-restart sudo path.** The cron entry was probably added on the assumption that `aaron` had broad NOPASSWD sudo for systemctl, which the 2026-05-01 sudoers fix narrowed to specific commands. If the `aaronai-watcher` restart isn't in sudoers, the watchdog has been silently failing. Whether or not it has, this is the same shape: a recovery mechanism configured, configured to look like it works, possibly not working. The session brief and the architecture doc didn't cross-check it.
|
||||
2. **Two parallel scheduling stacks.** APScheduler in api.py drives nightly work; three systemd `.timer` files exist but are not enabled. The duplication makes "what triggers a dream" harder to answer than it should be.
|
||||
|
||||
---
|
||||
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Phase 6 — Frontend routes
|
||||
|
||||
Next.js app router under `~/aaronai-web/app/`. Three user-facing routes plus a catch-all API proxy.
|
||||
|
||||
| Route | File | Auth | What it does | Backend support? |
|
||||
|---|---|---|---|---|
|
||||
| `/` | `app/page.tsx` | Required (cookie redirect to `/login`) | Main chat UI, sidebar, settings panel, dreamer status, corpus integrity status. | Yes — every backed `/api/*` endpoint is proxied through the catch-all. |
|
||||
| `/login` | `app/login/page.tsx` | None | Password login, sets `aaronai_session` cookie. | Yes — `POST /auth/login`. |
|
||||
| `/capture` | `app/capture/page.tsx` | None (mobile field-recorder, public) | Voice + image capture, posts to `/api/capture`. SSE listener on `/api/captures/events`. | Yes. |
|
||||
| `/api/[...slug]` | `app/api/[...slug]/route.ts` | Pass-through | Catch-all proxy: forwards every request to `${API_URL || 'https://ai.aaronnelson.studio'}/api/<slug>` (or `/<slug>` for `auth/*`). Forwards `cookie`, `content-type`, `set-cookie`. | Always — it is the proxy. |
|
||||
|
||||
That is the entire route surface. The frontend has no static `/dreams`, `/journal`, `/admin`, etc.; all dream output is delivered via Nextcloud and read out-of-band. The only data path between frontend and Aaron is chat, capture, and the SettingsPanel embedded in `/`.
|
||||
|
||||
**Behavior matches intent?** Yes against the architecture doc's Layer 3 list ("Login/logout … Chat desktop and mobile … Sidebar … Voice: tap-to-toggle … `/capture` voice + image"). The doc's "Not yet built" entries (Consolidation agent UI, drag-and-drop capture, LLM provider selector) are correctly absent.
|
||||
|
||||
**Notes:**
|
||||
- The catch-all proxy uses `process.env.API_URL` and falls back to `'https://ai.aaronnelson.studio'`. In production this is fine because the frontend talks back through the public domain (which nginx routes back to the same machine). Architecturally a bit roundabout (frontend → public DNS → nginx → backend on same host) but the deploy is consistent with what's documented.
|
||||
- I did not deep-read the route components or the `components/` directory — per Phase 6 scope ("don't go deep").
|
||||
|
||||
### Phase 6 summary
|
||||
|
||||
**Working and matching intent:** Three routes, all backed.
|
||||
|
||||
**Removal candidates:** None at this layer.
|
||||
|
||||
**NREM-shaped divergences:** None observed at the route level. (Component-level divergences would require deeper inspection.)
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -0,0 +1,105 @@
|
||||
# OCR install record — 2026-05-04
|
||||
|
||||
## Machine
|
||||
|
||||
- Host: aaronai-01 (VPS)
|
||||
- OS: Ubuntu 24.04 noble (kernel 6.8.0-110-generic, x86_64)
|
||||
|
||||
## apt packages installed
|
||||
|
||||
| package | version | source |
|
||||
|---|---|---|
|
||||
| tesseract-ocr | 5.3.4-1build5 | noble |
|
||||
| tesseract-ocr-eng | 1:4.1.0-2 | noble |
|
||||
| tesseract-ocr-osd | 1:4.1.0-2 | noble (automatic) |
|
||||
| libtesseract5 | 5.3.4-1build5 | noble (automatic) |
|
||||
|
||||
## pip packages installed (into /home/aaron/aaronai/venv)
|
||||
|
||||
| package | version |
|
||||
|---|---|
|
||||
| pytesseract | 0.3.13 |
|
||||
| ocrmypdf | 17.4.2 |
|
||||
|
||||
Direct dependencies pulled in by the two installs above (also new in venv): `pikepdf 10.5.1`, `pdfminer-six 20260107`, `pypdfium2 5.7.1`, `img2pdf 0.6.3`, `pi-heif 1.3.0`, `cryptography 47.0.0`, `cffi 2.0.0`, `pycparser 3.0`, `Deprecated 1.3.1`, `deprecation 2.1.0`, `defusedxml 0.7.1`, `fonttools 4.62.1`, `fpdf2 2.8.7`, `uharfbuzz 0.54.1`, `wrapt 2.1.2`, `pluggy 1.6.0`. `pillow` was already at 12.2.0.
|
||||
|
||||
## Smoke test 1 — `tesseract --version`
|
||||
|
||||
```
|
||||
tesseract 5.3.4
|
||||
leptonica-1.82.0
|
||||
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
|
||||
Found AVX512BW
|
||||
Found AVX512F
|
||||
```
|
||||
|
||||
## Smoke test 2 — `tesseract --list-langs`
|
||||
|
||||
```
|
||||
List of available languages in "/usr/share/tesseract-ocr/5/tessdata/" (2):
|
||||
eng
|
||||
osd
|
||||
```
|
||||
|
||||
## Smoke test 3 — pytesseract on a slide image
|
||||
|
||||
- Input pptx: `/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF555 3D Computational/GH Slicer Notes.pptx`
|
||||
- Extracted image: `ppt/media/image1.PNG` (1768×504 PNG)
|
||||
- Wall-clock: 0.220s
|
||||
- Chars extracted: 126
|
||||
- First 200 chars:
|
||||
|
||||
```
|
||||
Generates the Bounding Box for NESS
|
||||
|
||||
round(x, 4), round(y, 4), round(z, 4), round(a, 4))
|
||||
|
||||
Format ("HSS5 X(0} ¥(1} W(2} H(3)",
|
||||
```
|
||||
|
||||
Note: the first image in `Renders.pptx` (image1.jpg, 640×480) returned 0 chars on first attempt. Sampled 15 images in `Renders.pptx`; all 15 are pure rendered designs/photographs with no text. Switched to `GH Slicer Notes.pptx` (per the original 4-image-only-pptx candidate list) where image1.PNG is a textual code-screenshot. Tesseract behavior is correct in both cases; `Renders.pptx` is not a useful OCR test target because it contains no text. Some character-recognition noise on the code screenshot (e.g. `¥(1}` for `Y(1)`, mojibake on parentheses/braces) — acceptable for a baseline smoke; production tuning is a worker-design concern.
|
||||
|
||||
## Smoke test 4 — ocrmypdf on a Lexmark CX510de scan
|
||||
|
||||
- Input PDF: `/home/aaron/nextcloud/data/data/aaron/files/Admin/Dossier/Tenure/Dossier Scan 2022/image2022-01-07-133846 - CAryn.pdf` (4 pages, Producer: Lexmark CX510de, Creator: HardCopy)
|
||||
- Command: `ocrmypdf --skip-text -l eng <input> /tmp/ocr_smoke/caryn_ocred.pdf`
|
||||
- Wall-clock: 3.72s (whole PDF, 4 pages)
|
||||
- Exit: 0
|
||||
- After OCR, `pdftotext` on the output produced 2347 chars (2270 non-whitespace).
|
||||
- First 200 chars of OCR'd text:
|
||||
|
||||
```
|
||||
nN New Paltz
|
||||
STATE UNIVERSITY OF NEW YORK
|
||||
|
||||
The Honors Program
|
||||
|
||||
May 30, 2017
|
||||
|
||||
Dear Aaron,
|
||||
|
||||
Thank you for serving as a reader for Caryn Byllott’s thesis on "Recall/Reconstruct: The Exploration of
|
||||
Memory
|
||||
```
|
||||
|
||||
Real readable English. The "nN" header is the Lexmark logo glyph; otherwise clean. ~0.93s/page on this scan, which is the reference number for sizing the async worker queue.
|
||||
|
||||
## Reference timing
|
||||
|
||||
| operation | input size | wall-clock |
|
||||
|---|---|---|
|
||||
| pytesseract single image | 1768×504 PNG | 0.22s |
|
||||
| ocrmypdf 4-page scan | 4 pages, ~A4 | 3.72s (~0.93s/page) |
|
||||
|
||||
## Deferred — project dep-tracking
|
||||
|
||||
The project has no dependency manifest on disk: no `requirements.txt`, `pyproject.toml`, `setup.py`, `Pipfile`, or `poetry.lock`. Pip deps live only in `venv/`. The OCR install adds `pytesseract` and `ocrmypdf` (plus their transitive closure listed above) to that untracked venv state.
|
||||
|
||||
This commit does not introduce a manifest. Tracking the dep-manifest decision as its own followup; the natural deadline is the capture-path integration commit, where `import pytesseract` will become load-bearing in the repo. If the manifest question is unresolved by then, that integration commit is the right place to address it.
|
||||
|
||||
## Followups
|
||||
|
||||
- Async OCR worker (separate session). Use the reference timing above to size the queue.
|
||||
- Capture path integration: phone-camera images → `pytesseract.image_to_string` → existing chunk/embed pipeline.
|
||||
- Backlog processing of 75 scanned PDFs (Lexmark CX510de and similar) and the 4 image-only pptx (`Renders.pptx`, `Ribbon Cutting Slideshow.pptx`, two `GH Slicer Notes` variants). Per the smoke results, `Renders.pptx` is unlikely to yield useful OCR text — it is rendered-design content, not scanned documents — and may instead need exclusion rather than processing.
|
||||
- Project dep-manifest decision (see Deferred section above).
|
||||
@@ -0,0 +1,194 @@
|
||||
# scripts/ reorganization plan — 2026-05-02
|
||||
|
||||
*Track 1 Bucket B fix #4 — read-only proposal. Nothing moved or deleted yet. Approve before executing.*
|
||||
|
||||
## Summary
|
||||
|
||||
The `~/aaronai/scripts/` directory currently holds **41** `.py`/`.sh` files. Reading the listing it is hard to tell which files are live workers and which are completed-experiment artifacts. The proposed split:
|
||||
|
||||
| Bucket | Count | Destination |
|
||||
|---|---|---|
|
||||
| Production (stay) | 11 | `scripts/` |
|
||||
| Experimental (move) | 28 | `scripts/experiments/` (already exists, holds 4 files; will hold 32) |
|
||||
| Deprecated (move) | 2 | `scripts/deprecated/` (new) |
|
||||
| `.bak*` to delete | 19 | git history is the durable record |
|
||||
| Uncertain | 0 | n/a |
|
||||
|
||||
After execution, `ls scripts/*.py scripts/*.sh` should return only the 11 production files plus the two subdirectories.
|
||||
|
||||
## Reference checks performed
|
||||
|
||||
Before producing this plan I grepped:
|
||||
- `subprocess` calls inside `api.py` for paths under `scripts/`
|
||||
- `import` and string-path references inside every production script
|
||||
- `ExecStart=` lines across every `aaronai-*.service` in `/etc/systemd/system/`
|
||||
- The user crontab for any line invoking a `scripts/` path
|
||||
|
||||
**Findings:**
|
||||
- The only scripts referenced from `api.py` are `ingest.py` (line 43, `INGEST_SCRIPT`), `dream.py` (lines 661 and 1111), `ingest_conversations.py` (line 1127), and `corpus_integrity.py` (line 934, `CORPUS_INTEGRITY_SCRIPT`).
|
||||
- `api.py` (line 937) and `corpus_integrity.py` (line 29) reference the data file `~/aaronai/experiments/tier1_migration_state.json` — that path is the **state file** in `~/aaronai/experiments/`, not the script. Moving `tier1_migration.py` does not break either reader.
|
||||
- No production script imports or shells out to any experimental file.
|
||||
- All eight `aaronai-*.service` units' `ExecStart` lines point at production scripts only.
|
||||
- The user crontab references `backup.sh` and `aaronai-watcher` (a service) — no experimental files.
|
||||
|
||||
So the reorganization is safe at the reference level for every file in section B (experiments), C (deprecated), and D (delete). No moves change a runtime code path.
|
||||
|
||||
---
|
||||
|
||||
## A — PRODUCTION (stay in `scripts/`)
|
||||
|
||||
These 11 files are constraint-locked or referenced by an active runtime mechanism. None moves.
|
||||
|
||||
| File | Why it stays |
|
||||
|---|---|
|
||||
| `api.py` | `aaronai.service` ExecStart; long-running FastAPI backend; APScheduler. |
|
||||
| `dream.py` | `aaronai-dreamer.service` ExecStart; called by APScheduler in `api.py`; called by `/api/dreamer/run`. |
|
||||
| `watcher.py` | `aaronai-watcher.service` ExecStart; Stage 1 of the encoding pipeline. |
|
||||
| `stage2_worker.py` | `aaronai-stage2.service` ExecStart. |
|
||||
| `stage3_worker.py` | `aaronai-stage3.service` ExecStart (service is currently stopped, but the unit is enabled and the file is the unit's ExecStart). |
|
||||
| `graphiti_service.py` | `aaronai-graphiti.service` ExecStart. |
|
||||
| `ingest.py` | `INGEST_SCRIPT` constant in `api.py`; `/api/reindex` shells out to it. |
|
||||
| `ingest_conversations.py` | `aaronai-index-conversations.service` ExecStart **and** APScheduler `ingest_job` in `api.py`. |
|
||||
| `corpus_integrity.py` | `CORPUS_INTEGRITY_SCRIPT` constant in `api.py`; `/api/corpus/reconcile` shells out to it. |
|
||||
| `st_embedder.py` | Imported by `graphiti_service.py` at sidecar startup (`SentenceTransformerEmbedder`). |
|
||||
| `backup.sh` | User crontab `0 3 * * *` daily snapshot of `memory.md`, `settings.json`, `conversations.db`. |
|
||||
|
||||
---
|
||||
|
||||
## B — MOVE TO `scripts/experiments/`
|
||||
|
||||
28 files. None is referenced by any production code, systemd unit, or cron job.
|
||||
|
||||
For brevity, the "Why" column gives the experiment identity — full per-file write-ups are in the inventory's Phase 1 experimental table. The "Referenced by" column is the result of the grep against api.py / systemd ExecStart lines / cron / production scripts; "(none in production)" means no production code references it.
|
||||
|
||||
| Current path | Action | Why | Referenced by |
|
||||
|---|---|---|---|
|
||||
| `scripts/audit_expansion_draw.py` | move → `scripts/experiments/` | Type-aware stratified draw for n=20 audit expansion (sample-construction tool for `base_class_audit_rerun.py`). | (none in production) |
|
||||
| `scripts/base_class_test.py` | move → `scripts/experiments/` | Base-class enrichment OOP framing experiment, n=20. | (none in production) |
|
||||
| `scripts/base_class_validation.py` | move → `scripts/experiments/` | Base-class enrichment validation, n=50. | (none in production) |
|
||||
| `scripts/base_class_audit_rerun.py` | move → `scripts/experiments/` | Base-class n=8 paired-extraction audit. | (none in production) |
|
||||
| `scripts/briefing_generator_v2.py` | move → `scripts/experiments/` | Experiment 002b — briefing v2; validated 96% Mistral structural pattern. | (none in production) |
|
||||
| `scripts/briefing_test.py` | move → `scripts/experiments/` | Experiment 002 — briefing v1; superseded by v2. | (none in production) |
|
||||
| `scripts/cascade_test.py` | move → `scripts/experiments/` | Entity-drafter cascade n=20; falsified. | (none in production) |
|
||||
| `scripts/cascade_optimization_test.py` | move → `scripts/experiments/` | Optimized entity-drafter cascade n=30; confirmed entity-drafter cascade is dead. | (none in production) |
|
||||
| `scripts/consistency_test.py` | move → `scripts/experiments/` | Experiment 001 — Mistral 3-pass consistency, n=50. | (none in production) |
|
||||
| `scripts/consistency_test_v2.py` | move → `scripts/experiments/` | Experiment 003 — entity-only consistency with corrected sampling. | (none in production) |
|
||||
| `scripts/cost_test_graphiti_bulk.py` | move → `scripts/experiments/` | Bulk endpoint cost test, n=50. | (none in production) |
|
||||
| `scripts/cost_test_graphiti_bulk_retry.py` | move → `scripts/experiments/` | Retry of failed bulk batches (pre-MAX_QUEUED_QUERIES bump). | (none in production) |
|
||||
| `scripts/cost_test_graphiti_bulk_retry2.py` | move → `scripts/experiments/` | Second retry attempt, smaller batches. | (none in production) |
|
||||
| `scripts/cost_test_graphiti_migration.py` | move → `scripts/experiments/` | Single-episode migration cost test, n=50. | (none in production) |
|
||||
| `scripts/e1_select_sample.py` | move → `scripts/experiments/` | E1 sample selection. | (none in production) |
|
||||
| `scripts/e1_run_cascade.py` | move → `scripts/experiments/` | E1 cascade orchestration (initial). | (none in production) |
|
||||
| `scripts/e1_run_cascade_corrected.py` | move → `scripts/experiments/` | E1 corrected (custom_extraction_instructions path). | (none in production) |
|
||||
| `scripts/e1_per_source_predicates.py` | move → `scripts/experiments/` | E1 per-source predicate count, corrected metric. | (none in production) |
|
||||
| `scripts/e1_compare_metrics.py` | move → `scripts/experiments/` | E1 A vs B metrics comparison. | (none in production) |
|
||||
| `scripts/e14_select_sample.py` | move → `scripts/experiments/` | E1.4 stratified sample selection (n=30). | (none in production) |
|
||||
| `scripts/e14_run_cascade.py` | move → `scripts/experiments/` | E1.4 cascade orchestration. | (none in production) |
|
||||
| `scripts/e14_per_source_predicates.py` | move → `scripts/experiments/` | E1.4 per-source predicate diversity. | (none in production) |
|
||||
| `scripts/e16_rate_purity.py` | move → `scripts/experiments/` | E1.6 domain-purity human rating UI; surfaced taxonomic-mismatch finding. | (none in production) |
|
||||
| `scripts/e16_analyze.py` | move → `scripts/experiments/` | E1.6 Spearman correlation against E1.4. | (none in production) |
|
||||
| `scripts/e2_resolution_check.py` | move → `scripts/experiments/` | E2 entity-resolution diagnostic on six test entities. | (none in production) |
|
||||
| `scripts/e2_alias_followup.py` | move → `scripts/experiments/` | E2 alias follow-up (Aaron AI variants etc.). | (none in production) |
|
||||
| `scripts/e2_source_diversity.py` | move → `scripts/experiments/` | E2 episode count per entity. | (none in production) |
|
||||
| `scripts/token_measurement_test.py` | move → `scripts/experiments/` | Experiment 005 — token reduction measurement. | (none in production) |
|
||||
|
||||
`scripts/experiments/` already contains four files (`e1_8_eval.py`, `e1_8_taxfree_cascade.py`, `e1_9_retroactive.py`, `e3_dreamer_substrate.py`); after the move it holds 32. **No collisions** between current `scripts/` filenames and existing `scripts/experiments/` filenames — verified by the file lists.
|
||||
|
||||
---
|
||||
|
||||
## C — MOVE TO `scripts/deprecated/`
|
||||
|
||||
Two files. New directory `scripts/deprecated/` is created. Per the user constraint on tier1, both are flagged.
|
||||
|
||||
| Current path | Action | Why | Referenced by |
|
||||
|---|---|---|---|
|
||||
| `scripts/consolidator_v0_1.py` | move → `scripts/deprecated/` | The reframe doc explicitly identifies "consolidator-as-separate-system" as the architectural mistake (its function moves into the dream phase). The 0.1 calibration findings (2026-04-29) showed alias-resolution-from-graph-features-alone has structural problems on this corpus that threshold tuning cannot address. Bespoke decision dissolves the layer. | (none in production); `scripts/consolidator_v0_1.py.bak` is in section D. |
|
||||
| `scripts/tier1_migration.py` | move → `scripts/deprecated/` | One-shot completed 2026-04-30 (1,205 sources, 4,990 nodes, 22,289 edges). Under the bespoke decision the substrate this migrated **to** is being replaced; re-running the script against the bespoke substrate would not be the right move. **Flag (per Tier1 constraint):** the script's state file at `~/aaronai/experiments/tier1_migration_state.json` IS still consumed — `corpus_integrity.py:29` and `api.py:937` read it for the "graphiti coverage" report. **Moving the script does not affect the state file** (the state file lives in `~/aaronai/experiments/`, not `~/aaronai/scripts/`). The reader-vs-writer separation makes this safe. | (none in production); state file `~/aaronai/experiments/tier1_migration_state.json` consumed by `corpus_integrity.py` + `api.py`, not the script itself |
|
||||
|
||||
---
|
||||
|
||||
## D — DELETE (`.bak*` files)
|
||||
|
||||
19 files. Git history is the durable record of every prior version. Removing `.bak*` files is a cleanup, not a loss.
|
||||
|
||||
For each: action is `rm`. None is referenced by any production path.
|
||||
|
||||
| File | Approximate purpose |
|
||||
|---|---|
|
||||
| `scripts/api.py.bak.20260501-001427` | Pre-CV-pinning-strip / pre-F1 snapshot. |
|
||||
| `scripts/consolidator_v0_1.py.bak` | Pre-0.1.5-patch (Jaccard, before containment metric). |
|
||||
| `scripts/corpus_integrity.py.bak.20260501-021703` | Pre-F14 truncation snapshot. |
|
||||
| `scripts/dream.py.bak` | Older dreamer (pre v1.1 score-band). |
|
||||
| `scripts/dream.py.bak.20260501-002209` | Pre-F1 dreamer. |
|
||||
| `scripts/graphiti_service.py.bak` | Pre-bulk-saga sidecar. |
|
||||
| `scripts/graphiti_service.py.bak.20260501-185619` | Mid-rollback snapshot. |
|
||||
| `scripts/graphiti_service.py.bak.20260502-022307` | Mid-rollback snapshot (rolled-back work). |
|
||||
| `scripts/ingest.py.bak.20260501-004131` | Pre-F14 truncation snapshot. |
|
||||
| `scripts/stage2_worker.py.bak.20260501-171928` | v2.0 → v2.1 transition. |
|
||||
| `scripts/stage2_worker.py.bak.20260501-172531` | v2.1 patch step. |
|
||||
| `scripts/stage2_worker.py.bak.20260501-185942` | v2.1 patch step. |
|
||||
| `scripts/stage3_worker.py.bak.20260501-050354` | Pre-saga-split. |
|
||||
| `scripts/stage3_worker.py.bak.20260501-050453` | Pre-saga-split. |
|
||||
| `scripts/stage3_worker.py.bak.20260501-050719` | Pre-saga-split. |
|
||||
| `scripts/stage3_worker.py.bak.20260501-173233` | Mid-v2.1. |
|
||||
| `scripts/stage3_worker.py.bak.20260501-190357` | v2.1 final. |
|
||||
| `scripts/watcher.py.bak` | Pre-in-process refactor (2026-04-30). |
|
||||
| `scripts/watcher.py.bak.20260501-004131` | Pre-F14 truncation snapshot. |
|
||||
|
||||
Stage 3 alone has five `.bak` versions; Stage 2 has three. Both are visible in `git log` for the corresponding production files — no information is lost.
|
||||
|
||||
---
|
||||
|
||||
## E — UNCERTAIN
|
||||
|
||||
None. Every file in `scripts/` is classified above. The grep against api.py / systemd / cron / production scripts produced clean answers for each.
|
||||
|
||||
The `scripts/__pycache__/` directory exists and contains `.pyc` for `api`, `corpus_integrity`, `dream`, `ingest`, `stage3_worker`, `st_embedder`, `watcher` (notably no `.pyc` for `stage2_worker.py`). Not part of this plan, but Python regenerates `.pyc` on next import — `__pycache__/` is safe to remove at any time and has no bearing on the moves above. **Recommended but not in this plan: `rm -rf scripts/__pycache__/` after the moves complete, so stale entries for moved files don't linger.**
|
||||
|
||||
---
|
||||
|
||||
## Execution-step preview (NOT executed in this turn)
|
||||
|
||||
For when the plan is approved, the proposed mechanic is:
|
||||
|
||||
```bash
|
||||
mkdir -p ~/aaronai/scripts/deprecated/
|
||||
|
||||
# Section B — 28 moves to scripts/experiments/
|
||||
git mv scripts/{audit_expansion_draw,base_class_test,base_class_validation,base_class_audit_rerun, \
|
||||
briefing_generator_v2,briefing_test, \
|
||||
cascade_test,cascade_optimization_test, \
|
||||
consistency_test,consistency_test_v2, \
|
||||
cost_test_graphiti_bulk,cost_test_graphiti_bulk_retry,cost_test_graphiti_bulk_retry2,cost_test_graphiti_migration, \
|
||||
e1_select_sample,e1_run_cascade,e1_run_cascade_corrected,e1_per_source_predicates,e1_compare_metrics, \
|
||||
e14_select_sample,e14_run_cascade,e14_per_source_predicates, \
|
||||
e16_rate_purity,e16_analyze, \
|
||||
e2_resolution_check,e2_alias_followup,e2_source_diversity, \
|
||||
token_measurement_test}.py scripts/experiments/
|
||||
|
||||
# Section C — 2 moves to scripts/deprecated/
|
||||
git mv scripts/consolidator_v0_1.py scripts/tier1_migration.py scripts/deprecated/
|
||||
|
||||
# Section D — 19 deletes
|
||||
rm scripts/*.bak*
|
||||
|
||||
# Section E recommendation (post-move)
|
||||
rm -rf scripts/__pycache__/
|
||||
```
|
||||
|
||||
`git mv` keeps git history. After execution, a single commit with a body listing each move and delete (no Co-Authored-By trailer) would land the change.
|
||||
|
||||
---
|
||||
|
||||
## What this plan does NOT do
|
||||
|
||||
- Does not modify `api.py`, `corpus_integrity.py`, `tier1_migration.py`, or any other code. The `MIGRATION_STATE` path in `corpus_integrity.py:29` and the matching constant in `api.py:937` continue to point at `~/aaronai/experiments/tier1_migration_state.json` — unchanged by the move.
|
||||
- Does not modify any systemd unit. Every `ExecStart` continues to point at a `scripts/<production>.py` path that remains valid.
|
||||
- Does not touch the user crontab.
|
||||
- Does not touch `~/aaronai/db/` (separate decision flagged in inventory; ChromaDB-era 550M directory).
|
||||
- Does not delete `scripts/__pycache__/` (recommendation only).
|
||||
- Does not touch the four files already in `scripts/experiments/` (`e1_8_eval.py`, `e1_8_taxfree_cascade.py`, `e1_9_retroactive.py`, `e3_dreamer_substrate.py`).
|
||||
|
||||
## Awaiting approval
|
||||
|
||||
Tell me to proceed and I will execute Sections B → C → D in order, then run `git status` and `git diff --stat` so you can review before the commit. No commit will be made until you give the second go-ahead.
|
||||
@@ -0,0 +1,175 @@
|
||||
# Stage 2 Frame Analysis — 2026-05-03
|
||||
|
||||
*Improvement #3 of three Track 1 improvements. Read-only report on the frame data Stage 2 produces, in service of Track 2 substrate design (Step 2.4 operation set spec).*
|
||||
|
||||
**Data source:** `stage_3_queue.stage2_metadata` (jsonb), exposed via the new SQL view `stage2_frames_v`. Analysis script: `scripts/experiments/frame_distribution_report.py`. Sidecar JSON: `experiments/frame_distribution_2026-05-03.json`. **Stage 3 service is currently stopped, so this is a stable snapshot.**
|
||||
|
||||
---
|
||||
|
||||
## Verdict
|
||||
|
||||
**Frames cluster meaningfully but coverage is partial.** Frame distribution is skewed (one frame, "Education", appears in 36% of frame-extracted docs) but not degenerate — the top 20 frames carry recognizable domain signal, file-type bins differentiate them further, and per-doc frame counts are healthy. **However, only 56% of the embeddings corpus has any frame data at all.** The other 44% — conversations, short files, voice notes, dream outputs — has zero frame coverage by design, not by accident.
|
||||
|
||||
Frame-conditional routing is a viable γ component candidate **for the document side of the corpus**. It is not a viable router for the conversational or self-generated side without filling the coverage hole.
|
||||
|
||||
---
|
||||
|
||||
## 1. Corpus-wide frame coverage
|
||||
|
||||
| Class | Count | % of corpus | Frame coverage |
|
||||
|---|---|---|---|
|
||||
| Total distinct sources in `embeddings` | 1,255 | 100% | — |
|
||||
| Files with frames (`stage_3_queue.stage2_metadata`) | 704 | 56.1% | yes |
|
||||
| Conversations (Claude / ChatGPT / Aaron AI) | 198 | 15.8% | **none — bypass Stage 2 by design** |
|
||||
| Files <2,000 chars (Stage 2 char-gate skip) | 339 | 27.0% | **none — Mistral never invoked** |
|
||||
| Files that failed Stage 2 | 12 | 1.0% | none |
|
||||
|
||||
**56.1% frame coverage** is the headline. The architectural reason for the gap is twofold:
|
||||
|
||||
1. **`ingest_conversations.py` writes directly to `embeddings`** with `type='aaronai_conversation'` and never enqueues to `stage_2_queue`. Conversations have never been frame-extracted, full stop.
|
||||
2. **`stage2_worker.py:139` gates Mistral on char_length.** Docs <2,000 chars are marked complete with `completed_at = NOW()` *before* Mistral runs. The Mistral cost is not paid for these (correction to my earlier framing in the inventory) — but neither is any frame data produced.
|
||||
|
||||
## 2. Frame distribution (the docs that DO have frames)
|
||||
|
||||
**668 docs, 1,374 distinct frame labels. Top-20 by count:**
|
||||
|
||||
| Frame | Count | % of frame-extracted docs |
|
||||
|---|---|---|
|
||||
| Education | 238 | 35.6% |
|
||||
| Course | 58 | 8.7% |
|
||||
| Programming | 43 | 6.4% |
|
||||
| Design | 32 | 4.8% |
|
||||
| Professional Experience | 24 | 3.6% |
|
||||
| Employment | 24 | 3.6% |
|
||||
| Research | 23 | 3.4% |
|
||||
| 3D Printing | 22 | 3.3% |
|
||||
| Project, Grading, Art, Budget | 21 each | 3.1% |
|
||||
| Academic Integrity | 20 | 3.0% |
|
||||
| Teaching, Technology, Attendance, Application | 13–19 | — |
|
||||
| Accommodation, Manufacturing, Coursework, Recommendation | 10–13 | — |
|
||||
|
||||
**Per-doc frame count:** median 3–4 frames per doc; 76% of docs have 3–5 frames; one outlier doc has 30 frames (Mistral over-segmented).
|
||||
|
||||
**Long tail is enormous.** 1,374 distinct labels for 668 docs means most labels appear once. Mistral is producing a near-open vocabulary, not a clean taxonomy.
|
||||
|
||||
**"Education" is the universal frame.** It dominates co-occurrence pairs (8 of the top-10 pairs include Education). Education functions as a near-tautology for this corpus and carries less discriminating signal than narrower frames like "Programming" or "3D Printing."
|
||||
|
||||
## 3. Label hygiene
|
||||
|
||||
**54 normalized collisions** detected (case-insensitive, underscore-vs-space):
|
||||
|
||||
| Concept | Variant counts |
|
||||
|---|---|
|
||||
| Professional Experience | `Professional Experience`:24 + `Professional_Experience`:6 |
|
||||
| 3D Printing | `3D Printing`:22 + `3D_Printing`:7 |
|
||||
| Academic Integrity | `Academic Integrity`:20 + `Academic_Integrity`:2 |
|
||||
| Course Design | `Course Design`:9 + `Course_Design`:1 |
|
||||
| Project Management | `Project Management`:7 + `Project_Management`:1 |
|
||||
| Computational Design | `Computational Design`:7 + `Computational_Design`:1 |
|
||||
| (… 48 more) | |
|
||||
|
||||
Without normalization, ~30+ documents have their frames silently split across spelling variants for the same concept. Any frame-conditional router must normalize before counting. Recommended canonical form: lowercase, single-space, hyphens preserved.
|
||||
|
||||
## 4. Worker version drift
|
||||
|
||||
| Worker version | Doc count | Notes |
|
||||
|---|---|---|
|
||||
| v2.1 | 665 | Two ad-hoc-key intrusions: `academic_details` (1 doc), `additional_information` (1 doc). Mistral occasionally invents extra structured keys not in the prompt schema. |
|
||||
| v2.0 | 3 | Same key shape as v2.1 baseline. |
|
||||
|
||||
Schema is stable across the version transition for this dataset. The ad-hoc keys are a Mistral quirk (instruction-following variance), not a worker bug. **For Track 2 substrate ingest, plan for `stage2_metadata` to occasionally include unexpected top-level keys.**
|
||||
|
||||
## 5. File-type signal
|
||||
|
||||
This is the most useful Track 2 finding from this report.
|
||||
|
||||
`stage_3_queue.source` stores bare filenames, so I bin by file-type suffix. Frames stratify cleanly:
|
||||
|
||||
| Frame | pdf | docx | pptx | markdown | txt | dream |
|
||||
|---|---|---|---|---|---|---|
|
||||
| Education | 116 | 119 | 3 | — | — | — |
|
||||
| Course | 29 | 29 | — | — | — | — |
|
||||
| Programming | 12 | 10 | **15** | — | 6 | — |
|
||||
| Application | **13** | 2 | — | — | — | — |
|
||||
| 3D Printing | 11 | 3 | **8** | — | — | — |
|
||||
| Manufacturing | 3 | 6 | 4 | — | — | — |
|
||||
| Research | 9 | 13 | — | 1 | — | — |
|
||||
|
||||
**Concrete signal:** "Programming" pivots toward pptx (slide decks), "Application" pivots toward pdf (compiled PDFs), Education spreads across pdf+docx (syllabi and dossiers). File type is essentially free signal — the watcher already knows it — and it disambiguates frames that the model treats as equivalent. **`embeddings.type` is currently NULL for 71% of rows per inventory finding 5; backfilling that field (Improvement #2) makes file-type signal actually queryable instead of reverse-engineerable from filenames.**
|
||||
|
||||
## 6. Systematic exclusions inside the 339-doc gap
|
||||
|
||||
Of the 339 short docs that bypass frame extraction, the breakdown by file type:
|
||||
|
||||
| Type | Count | What this is |
|
||||
|---|---|---|
|
||||
| pdf | 110 | Short PDFs (forms, single-page docs) |
|
||||
| docx | 110 | Short Word docs |
|
||||
| dream_output | 39 | **The dreamer's own NREM/Early-REM/Late-REM/synthesis files** |
|
||||
| pptx | 31 | Short slide decks |
|
||||
| txt | 28 | Plain-text files |
|
||||
| voice_note | 14 | **Every voice note in the corpus** |
|
||||
| markdown | 7 | Short markdown |
|
||||
|
||||
**Two specific systematic exclusions worth naming separately:**
|
||||
|
||||
- **All 14 voice notes have no frames.** Voice is one of Aaron's primary capture channels. The frame system is silent on it.
|
||||
- **All 39 dream outputs have no frames.** The dreamer's writing is invisible to the frame system that orients the dreamer's own next pass. The system cannot frame-condition on its own output.
|
||||
|
||||
These are NREM-shape findings: the architecture's frame extraction is *quietly* not running on whole categories of input that the architecture treats as first-class. Recommended for the inventory.
|
||||
|
||||
---
|
||||
|
||||
## 7. Would frame-conditional routing be a viable γ component, and what would it condition on?
|
||||
|
||||
**Viable on the framed-doc subset, subject to validation on larger samples for §5 stratification.** The 56% of corpus with frames shows real distributional signal; the 44% gap is unrouted. Conditions for the framed-doc subset:
|
||||
|
||||
1. **Normalize labels before any routing decision.** 54 collision groups today; the router must operate on normalized canonical form, not raw Mistral output. Add a normalization layer between Mistral and any consumer.
|
||||
2. **Treat "Education" as a near-universal prior, not a frame.** It carries low routing signal because it's everywhere. Either drop it from the conditional, or use it as the *base case* and condition on the secondary frame. (See §8 follow-up — the dominance may be a Mistral prompt artifact rather than a corpus shape; cheap diagnostic available.)
|
||||
3. **Combine frames with file type, not frames alone.** Frame × file-type stratifies more cleanly than frame alone (see §5). The §5 cross-tab is suggestive — Programming → pptx (n=15), Application → pdf (n=13) — but cell counts are small and need validation on a larger sample before being load-bearing for substrate design.
|
||||
|
||||
**What it would condition on:** the joint of (normalized frame set, file type, doc length bucket). Concretely, a Track 2 router could compute `P(this doc is relevant to current goal | frames ∩ goal_frames, file_type, length)` rather than using a fixed cosine similarity threshold. Frames give the topic axis; file type gives the genre axis; length gives the granularity axis.
|
||||
|
||||
**Defined scope (the coverage caveat):**
|
||||
|
||||
The router only works on the 56% of corpus that has frames. To extend to the full corpus, Track 2 has three options:
|
||||
|
||||
- **(a) Backfill frames for short docs and conversations.** Run Mistral on the 339 short docs (cheap — they're short) and on the 198 conversations. This makes frames a corpus-wide signal at the cost of a one-time Mistral run.
|
||||
- **(b) Use a degraded fallback for unframed docs.** File-type signal is available for short files; conversation type is available for conversations. Route those by their available signal; route framed docs by frame+type.
|
||||
- **(c) Accept the gap as a scope limit.** The router only operates on long, non-conversation files. The 44% gap is unrouted (whatever the current default is).
|
||||
|
||||
(a) is the most general and the most aligned with the architecture's stated commitment ("Stage 2 produces orientation metadata for everything"). Mistral cost on 537 short docs is small. **Recommend (a) before any router work begins.**
|
||||
|
||||
---
|
||||
|
||||
## 8. Recommended follow-ups (ordered by ROI)
|
||||
|
||||
1. **Backfill the 339 short docs.** Run a one-shot script that bypasses the char_length gate and runs Mistral on them. The voice notes and dream outputs are the highest priorities — primary capture and primary self-reflection channels currently silent.
|
||||
2. **Backfill conversations into frame extraction.** Either modify `ingest_conversations.py` to enqueue Stage 2, or run a one-shot conversation-frame extraction pass. This is the larger backfill (198 conversations, multiple chunks each) but it removes the conversational coverage hole.
|
||||
3. **Add a frame-label normalizer at the worker.** New rows write a normalized canonical form alongside the raw Mistral output. Older rows can be normalized at query time via the view.
|
||||
4. **Decide whether to deprecate "Education" as a frame.** It's so universal in this corpus that it adds noise. Either drop it from Mistral's prompt, or downweight it in any router that conditions on frames.
|
||||
5. **Per-frame retrieval-similarity follow-up (deferred from this report).** Now that we know frames cluster meaningfully, instrumenting `dream.py` to record per-source similarity per stage becomes worthwhile. That tells us whether retrieval implicitly prefers certain frames already.
|
||||
|
||||
6. **Diagnose the "Education" dominance: prompt artifact vs. corpus shape.** Education appears in 36% of frame-extracted docs. Two hypotheses: (a) Mistral's prompt biases toward institutional/academic framings (prompt artifact); (b) the corpus genuinely is dominated by academic/teaching content (corpus shape). Cheap diagnostic: hand-inspect 20 random docs tagged "Education", classify as *truly academic content* vs. *Education was a default Mistral reached for*. If the split is mostly (b), Education is honest signal and the router should treat it as a base case; if mostly (a), revise the Mistral prompt to discourage default tags. 20-doc sample is small enough to do in one sitting, large enough to distinguish the hypotheses at >70/30 splits.
|
||||
|
||||
---
|
||||
|
||||
## 9. Inventory edits flagged for session-end batch
|
||||
|
||||
- **Correction:** `stage2_metadata` lives on `stage_3_queue.stage2_metadata` (jsonb), not on `stage_2_queue` as the inventory implied. The Phase 1 / `stage2_worker.py` entry should be corrected.
|
||||
- **New finding:** the char_length gate runs *before* the Mistral call (`stage2_worker.py:139` precedes `:147`). For the 339 sub-2000-char docs, Mistral is never invoked. Reframes the architecture's "Stage 2 extracts orientation for everything" commitment.
|
||||
- **New finding:** `ingest_conversations.py` bypasses Stage 2 entirely. 198 conversation sources have zero frame coverage by design. Same NREM shape as #1 — a routing decision the architecture didn't explicitly request.
|
||||
- **New finding (cross-link to #2):** `embeddings.type` NULL-rate findings now have a concrete read consumer. File-type signal would unlock the frame × file-type stratification described in §5.
|
||||
- **New finding:** Within the 339-doc data gap, two systematic categorical exclusions are worth naming separately: **all 14 voice notes** and **all 39 dream outputs** are in the gap. Voice is one of Aaron's primary capture channels; dream outputs are the dreamer's own self-generated reflection. Both are silent to the frame system that orients downstream extraction — which means the dreamer cannot frame-condition on its own output. Same NREM shape as the others — a routing decision the architecture didn't explicitly request.
|
||||
|
||||
## 10. Reproduction
|
||||
|
||||
```bash
|
||||
cd ~/aaronai
|
||||
venv/bin/python3 scripts/experiments/frame_distribution_report.py
|
||||
# stdout: human-readable report
|
||||
# json: experiments/frame_distribution_<date>.json
|
||||
# view: stage2_frames_v (in pgvector DB)
|
||||
```
|
||||
|
||||
The view is `CREATE OR REPLACE`, idempotent. Drop with `DROP VIEW stage2_frames_v;` if needed.
|
||||
@@ -1,30 +0,0 @@
|
||||
{
|
||||
"last_dream_timestamp": 1777480274.444462,
|
||||
"last_dream_mode": "pipeline",
|
||||
"last_dream_file": "Journal/Dreams/2026-04-29-synthesis-1.md",
|
||||
"retrieved_sources": [
|
||||
"ChatGPT: CV Summary Request",
|
||||
"Dossier Narrative.docx",
|
||||
"2026-04-28-early-rem.md",
|
||||
"Advances in Architectural Geometry 2023 -- Kathrin D\u00f6rfler (editor); Jan Knippers (editor); Achim.pdf",
|
||||
"2026-04-29-late-rem.md",
|
||||
"Mod06_GrabCAD_Print_and _Advanced_FDM_2023.pptx",
|
||||
"The Extended Mind _ The Power of Thinking Outside the Brain -- Annie Murphy Paul.pdf",
|
||||
"Utah MDD - Aaron Nelson - Copy.pptx",
|
||||
"ChatGPT: Dean Position Evaluation",
|
||||
"Company of One -- Paul Jarvis.pdf",
|
||||
"References.docx",
|
||||
"Dossier Narrative Kill Me PLS_REV.docx",
|
||||
"ChatGPT: Career change anxiety",
|
||||
"2026-04-27-early-rem-1.md",
|
||||
"The Poetics of Space -- Gaston Bachelard translated from the French by Maria Jolas -- First Edition, 1994.pdf",
|
||||
"ChatGPT: Digital fabrication education",
|
||||
"Dossier Narrative Kill Me PLS.docx",
|
||||
"ChatGPT: Digital Fabrication Cultural Project",
|
||||
"Claude: Weighing Utah versus Oklahoma",
|
||||
"Claude: Importing chat history from ChatGPT",
|
||||
"Claude: I filling out my annual report...",
|
||||
"References.pdf",
|
||||
"Dossier Narrative Kill Me PLS_REV_HOME.docx"
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,791 @@
|
||||
{
|
||||
"metadata": {
|
||||
"generated_at": "2026-04-28T21:10:25",
|
||||
"source_validation_file": "/home/aaron/aaronai/experiments/base_class_validation_results.json",
|
||||
"seed": 43,
|
||||
"stratification": "type-aware within length bucket",
|
||||
"type_targets": {
|
||||
"small": {
|
||||
"course_module": 2,
|
||||
"voice_capture": 2
|
||||
},
|
||||
"medium": {
|
||||
"course_module": 2,
|
||||
"syllabus": 1,
|
||||
"other": 1
|
||||
},
|
||||
"large": {
|
||||
"course_ppt": 1,
|
||||
"syllabus": 1,
|
||||
"faculty_report": 1,
|
||||
"conversational": 1
|
||||
}
|
||||
},
|
||||
"bucket_counts": {
|
||||
"small": 4,
|
||||
"medium": 4,
|
||||
"large": 4
|
||||
},
|
||||
"excluded_count": 10,
|
||||
"warnings": [],
|
||||
"purpose": "n=20 audit expansion per audit-expansion-protocol.md (type-aware amendment)"
|
||||
},
|
||||
"results": [
|
||||
{
|
||||
"source": "02_2D Geometry.docx",
|
||||
"size_bucket": "small",
|
||||
"doc_chars_original": 188,
|
||||
"doc_chars_sent": 188,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 207,
|
||||
"output_tokens": 718,
|
||||
"latency_s": 3.29,
|
||||
"metrics": {
|
||||
"n_entities": 10,
|
||||
"n_edges": 21,
|
||||
"predicate_diversity": 3,
|
||||
"type_diversity": 3,
|
||||
"avg_degree": 4.2,
|
||||
"largest_component": 10,
|
||||
"largest_component_pct": 100.0
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"2D geometry\", \"type\": \"geometric concept\"},\n {\"name\": \"Curves\", \"type\": \"geometric element\"},\n {\"name\": \"lines\", \"type\": \"geometric element\"},\n {\"name\": \"Poly lines\", \"type\": \"geometric element\"},\n {\"name\": \"planar\", \"type\": \"geometric property\"},\n {\"name\": \"non-planar\", \"type\": \"geometric property\"},\n {\"name\": \"open\", \"type\": \"geometric property\"},\n {\"name\": \"closed\", \"type\": \"geometric property\"},\n {\"name\": \"complex forms\", \"type\": \"geometric concept\"},\n {\"name\": \"geometries\", \"type\": \"geometric concept\"}\n ],\n \"edges\": [\n {\"subject\": \"2D geometry\", \"predicate\": \"is comprised of\", \"object\": \"Curves\"},\n {\"subject\": \"2D geometry\", \"predicate\": \"is comprised of\", \"object\": \"lines\"},\n {\"subject\": \"2D geometry\", \"predicate\": \"is comprised of\", \"object\": \"Poly lines\"},\n {\"subject\": \"Curves\", \"predicate\": \"can be\", \"object\": \"planar\"},\n {\"subject\": \"Curves\", \"predicate\": \"can be\", \"object\": \"non-planar\"},\n {\"subject\": \"Curves\", \"predicate\": \"can be\", \"object\": \"open\"},\n {\"subject\": \"Curves\", \"predicate\": \"can be\", \"object\": \"closed\"},\n {\"subject\": \"lines\", \"predicate\": \"can be\", \"object\": \"planar\"},\n {\"subject\": \"lines\", \"predicate\": \"can be\", \"object\": \"non-planar\"},\n {\"subject\": \"lines\", \"predicate\": \"can be\", \"object\": \"open\"},\n {\"subject\": \"lines\", \"predicate\": \"can be\", \"object\": \"closed\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"can be\", \"object\": \"planar\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"can be\", \"object\": \"non-planar\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"can be\", \"object\": \"open\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"can be\", \"object\": \"closed\"},\n {\"subject\": \"Curves\", \"predicate\": \"are building blocks to produce\", \"object\": \"complex forms\"},\n {\"subject\": \"lines\", \"predicate\": \"are building blocks to produce\", \"object\": \"complex forms\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"are building blocks to produce\", \"object\": \"complex forms\"},\n {\"subject\": \"Curves\", \"predicate\": \"are building blocks to produce\", \"object\": \"geometries\"},\n {\"subject\": \"lines\", \"predicate\": \"are building blocks to produce\", \"object\": \"geometries\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"are building blocks to produce\", \"object\": \"geometries\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"skipped": "local_model_failed",
|
||||
"local_error": "HTTPConnectionPool(host='localhost', port=11434): Read timed out. (read timeout=180)",
|
||||
"local_latency_s": 180.1
|
||||
},
|
||||
"_type": "course_module"
|
||||
},
|
||||
{
|
||||
"source": "03_2D Transformation, Deformation, and Editing-AARON.docx",
|
||||
"size_bucket": "small",
|
||||
"doc_chars_original": 541,
|
||||
"doc_chars_sent": 541,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 271,
|
||||
"output_tokens": 587,
|
||||
"latency_s": 2.9,
|
||||
"metrics": {
|
||||
"n_entities": 14,
|
||||
"n_edges": 13,
|
||||
"predicate_diversity": 9,
|
||||
"type_diversity": 4,
|
||||
"avg_degree": 1.86,
|
||||
"largest_component": 11,
|
||||
"largest_component_pct": 78.6
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Manipulation of geometry\", \"type\": \"Concept\"},\n {\"name\": \"Transformations\", \"type\": \"Category\"},\n {\"name\": \"Deformations\", \"type\": \"Category\"},\n {\"name\": \"proportions\", \"type\": \"Property\"},\n {\"name\": \"relative geometries\", \"type\": \"Property\"},\n {\"name\": \"curves\", \"type\": \"Geometric Element\"},\n {\"name\": \"objects\", \"type\": \"Geometric Element\"},\n {\"name\": \"space\", \"type\": \"Concept\"},\n {\"name\": \"angles\", \"type\": \"Property\"},\n {\"name\": \"curvature\", \"type\": \"Property\"},\n {\"name\": \"Editing Commands\", \"type\": \"Concept\"},\n {\"name\": \"geometry\", \"type\": \"Concept\"},\n {\"name\": \"sections\", \"type\": \"Geometric Element\"},\n {\"name\": \"form\", \"type\": \"Geometric Element\"}\n ],\n \"edges\": [\n {\"subject\": \"Manipulation of geometry\", \"predicate\": \"can be broken down into\", \"object\": \"Transformations\"},\n {\"subject\": \"Manipulation of geometry\", \"predicate\": \"can be broken down into\", \"object\": \"Deformations\"},\n {\"subject\": \"Transformations\", \"predicate\": \"do not change\", \"object\": \"proportions\"},\n {\"subject\": \"Transformations\", \"predicate\": \"do not change\", \"object\": \"relative geometries\"},\n {\"subject\": \"Transformations\", \"predicate\": \"change\", \"object\": \"space\"},\n {\"subject\": \"Deformations\", \"predicate\": \"are changes made to\", \"object\": \"geometry\"},\n {\"subject\": \"Deformations\", \"predicate\": \"affect\", \"object\": \"proportions\"},\n {\"subject\": \"Deformations\", \"predicate\": \"affect\", \"object\": \"angles\"},\n {\"subject\": \"Deformations\", \"predicate\": \"affect\", \"object\": \"curvature\"},\n {\"subject\": \"Editing Commands\", \"predicate\": \"can edit\", \"object\": \"geometry\"},\n {\"subject\": \"Editing Commands\", \"predicate\": \"involves adding\", \"object\": \"sections\"},\n {\"subject\": \"Editing Commands\", \"predicate\": \"involves combining\", \"object\": \"sections\"},\n {\"subject\": \"Editing Commands\", \"predicate\": \"involves removing\", \"object\": \"sections\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 112.09,
|
||||
"local_metadata": {
|
||||
"language": "en",
|
||||
"char_length": 230,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": false,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": false,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "technical",
|
||||
"one_sentence_summary": "Document discusses manipulation of geometry through transformations and deformations, with methods for editing geometry"
|
||||
},
|
||||
"local_raw": "{\n \"language\": \"en\",\n \"char_length\": 230,\n \"primary_format\": \"prose\",\n \"structural_signals\": {\n \"has_headings\": false,\n \"has_bullet_lists\": false,\n \"has_numbered_lists\": false,\n \"has_tables\": false,\n \"has_code_blocks\": false,\n \"has_dates\": false\n },\n \"content_signals\": {\n \"has_named_people\": false,\n \"has_institutional_language\": false,\n \"has_technical_terminology\": true,\n \"has_first_person\": false,\n \"has_quotations\": false\n },\n \"domain_class\": \"technical\",\n \"one_sentence_summary\": \"Document discusses manipulation of geometry through transformations and deformations, with methods for editing geometry\"\n}",
|
||||
"api_input_tokens": 533,
|
||||
"api_output_tokens": 691,
|
||||
"api_latency_s": 3.39,
|
||||
"metrics": {
|
||||
"n_entities": 17,
|
||||
"n_edges": 16,
|
||||
"predicate_diversity": 8,
|
||||
"type_diversity": 5,
|
||||
"avg_degree": 1.88,
|
||||
"largest_component": 13,
|
||||
"largest_component_pct": 76.5
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Manipulation of geometry\", \"type\": \"concept\"},\n {\"name\": \"Transformations\", \"type\": \"category\"},\n {\"name\": \"Deformations\", \"type\": \"category\"},\n {\"name\": \"proportions\", \"type\": \"attribute\"},\n {\"name\": \"relative geometries\", \"type\": \"attribute\"},\n {\"name\": \"curves\", \"type\": \"geometric_element\"},\n {\"name\": \"objects\", \"type\": \"geometric_element\"},\n {\"name\": \"space\", \"type\": \"concept\"},\n {\"name\": \"angles\", \"type\": \"attribute\"},\n {\"name\": \"curvature\", \"type\": \"attribute\"},\n {\"name\": \"Editing Commands\", \"type\": \"concept\"},\n {\"name\": \"geometry\", \"type\": \"concept\"},\n {\"name\": \"adding\", \"type\": \"operation\"},\n {\"name\": \"combining\", \"type\": \"operation\"},\n {\"name\": \"removing\", \"type\": \"operation\"},\n {\"name\": \"sections\", \"type\": \"geometric_element\"},\n {\"name\": \"form\", \"type\": \"geometric_element\"}\n ],\n \"edges\": [\n {\"subject\": \"Manipulation of geometry\", \"predicate\": \"breaks down into\", \"object\": \"Transformations\"},\n {\"subject\": \"Manipulation of geometry\", \"predicate\": \"breaks down into\", \"object\": \"Deformations\"},\n {\"subject\": \"Transformations\", \"predicate\": \"do not change\", \"object\": \"proportions\"},\n {\"subject\": \"Transformations\", \"predicate\": \"do not change\", \"object\": \"relative geometries\"},\n {\"subject\": \"Transformations\", \"predicate\": \"change\", \"object\": \"relationship to space\"},\n {\"subject\": \"Deformations\", \"predicate\": \"are changes to\", \"object\": \"geometry\"},\n {\"subject\": \"Deformations\", \"predicate\": \"affect\", \"object\": \"proportions\"},\n {\"subject\": \"Deformations\", \"predicate\": \"affect\", \"object\": \"angles\"},\n {\"subject\": \"Deformations\", \"predicate\": \"affect\", \"object\": \"curvature\"},\n {\"subject\": \"Editing Commands\", \"predicate\": \"can edit\", \"object\": \"geometry\"},\n {\"subject\": \"adding\", \"predicate\": \"is method for\", \"object\": \"Editing Commands\"},\n {\"subject\": \"combining\", \"predicate\": \"is method for\", \"object\": \"Editing Commands\"},\n {\"subject\": \"removing\", \"predicate\": \"is method for\", \"object\": \"Editing Commands\"},\n {\"subject\": \"adding\", \"predicate\": \"operates on\", \"object\": \"sections\"},\n {\"subject\": \"combining\", \"predicate\": \"operates on\", \"object\": \"sections\"},\n {\"subject\": \"removing\", \"predicate\": \"operates on\", \"object\": \"sections\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
},
|
||||
"_type": "course_module"
|
||||
},
|
||||
{
|
||||
"source": "2026-04-26-23-04-voice.md",
|
||||
"size_bucket": "small",
|
||||
"doc_chars_original": 931,
|
||||
"doc_chars_sent": 931,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 397,
|
||||
"output_tokens": 672,
|
||||
"latency_s": 3.43,
|
||||
"metrics": {
|
||||
"n_entities": 13,
|
||||
"n_edges": 17,
|
||||
"predicate_diversity": 16,
|
||||
"type_diversity": 7,
|
||||
"avg_degree": 2.62,
|
||||
"largest_component": 7,
|
||||
"largest_component_pct": 53.8
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Capture \u2014 2026-04-26-23-04\", \"type\": \"voice recording\"},\n {\"name\": \"extended mine experiment\", \"type\": \"project\"},\n {\"name\": \"extended mine project\", \"type\": \"project\"},\n {\"name\": \"Claude AI\", \"type\": \"AI assistant\"},\n {\"name\": \"Bird\", \"type\": \"dog\"},\n {\"name\": \"Cat\", \"type\": \"person\"},\n {\"name\": \"apple orchard\", \"type\": \"location\"},\n {\"name\": \"house\", \"type\": \"location\"},\n {\"name\": \"rose\", \"type\": \"location\"},\n {\"name\": \"stick\", \"type\": \"object\"},\n {\"name\": \"teal crocodile toy\", \"type\": \"object\"},\n {\"name\": \"dishes\", \"type\": \"object\"},\n {\"name\": \"sink\", \"type\": \"location\"}\n ],\n \"edges\": [\n {\"subject\": \"Capture \u2014 2026-04-26-23-04\", \"predicate\": \"is a\", \"object\": \"voice recording\"},\n {\"subject\": \"Capture \u2014 2026-04-26-23-04\", \"predicate\": \"has modality\", \"object\": \"audio\"},\n {\"subject\": \"Capture \u2014 2026-04-26-23-04\", \"predicate\": \"has status\", \"object\": \"unprocessed\"},\n {\"subject\": \"speaker\", \"predicate\": \"worked on\", \"object\": \"extended mine project\"},\n {\"subject\": \"speaker\", \"predicate\": \"worked with\", \"object\": \"Claude AI\"},\n {\"subject\": \"speaker\", \"predicate\": \"did\", \"object\": \"dishes\"},\n {\"subject\": \"dishes\", \"predicate\": \"were in\", \"object\": \"sink\"},\n {\"subject\": \"speaker\", \"predicate\": \"took on walk\", \"object\": \"Bird\"},\n {\"subject\": \"Cat\", \"predicate\": \"accompanied on walk\", \"object\": \"Bird\"},\n {\"subject\": \"Bird\", \"predicate\": \"walked through\", \"object\": \"apple orchard\"},\n {\"subject\": \"apple orchard\", \"predicate\": \"is next to\", \"object\": \"house\"},\n {\"subject\": \"Bird\", \"predicate\": \"walked through\", \"object\": \"rose\"},\n {\"subject\": \"Bird\", \"predicate\": \"pooped in\", \"object\": \"orchard\"},\n {\"subject\": \"Bird\", \"predicate\": \"chased\", \"object\": \"stick\"},\n {\"subject\": \"Bird\", \"predicate\": \"has\", \"object\": \"teal crocodile toy\"},\n {\"subject\": \"Bird\", \"predicate\": \"is following\", \"object\": \"speaker\"},\n {\"subject\": \"Bird\", \"predicate\": \"is squeaking at\", \"object\": \"speaker\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 20.78,
|
||||
"local_metadata": {
|
||||
"language": "en",
|
||||
"char_length": 382,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": false,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": true
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": true,
|
||||
"has_institutional_language": false,
|
||||
"has_technical_terminology": false,
|
||||
"has_first_person": true,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "personal",
|
||||
"one_sentence_summary": "A personal account of an individual's activities on a specific day, including household chores and pet care"
|
||||
},
|
||||
"local_raw": "{\n \"language\": \"en\",\n \"char_length\": 382,\n \"primary_format\": \"prose\",\n \"structural_signals\": {\n \"has_headings\": false,\n \"has_bullet_lists\": false,\n \"has_numbered_lists\": false,\n \"has_tables\": false,\n \"has_code_blocks\": false,\n \"has_dates\": true\n },\n \"content_signals\": {\n \"has_named_people\": true,\n \"has_institutional_language\": false,\n \"has_technical_terminology\": false,\n \"has_first_person\": true,\n \"has_quotations\": false\n },\n \"domain_class\": \"personal\",\n \"one_sentence_summary\": \"A personal account of an individual's activities on a specific day, including household chores and pet care\"\n}",
|
||||
"api_input_tokens": 661,
|
||||
"api_output_tokens": 740,
|
||||
"api_latency_s": 3.6,
|
||||
"metrics": {
|
||||
"n_entities": 21,
|
||||
"n_edges": 17,
|
||||
"predicate_diversity": 17,
|
||||
"type_diversity": 12,
|
||||
"avg_degree": 1.62,
|
||||
"largest_component": 12,
|
||||
"largest_component_pct": 57.1
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"2026-04-26\", \"type\": \"date\"},\n {\"name\": \"voice recording\", \"type\": \"artifact\"},\n {\"name\": \"field recording\", \"type\": \"method\"},\n {\"name\": \"extended mine experiment\", \"type\": \"project\"},\n {\"name\": \"extended mine project\", \"type\": \"project\"},\n {\"name\": \"Claude AI\", \"type\": \"entity\"},\n {\"name\": \"dishes\", \"type\": \"object\"},\n {\"name\": \"sink\", \"type\": \"location\"},\n {\"name\": \"Bird\", \"type\": \"animal\"},\n {\"name\": \"Dog\", \"type\": \"animal\"},\n {\"name\": \"walk\", \"type\": \"activity\"},\n {\"name\": \"apple orchard\", \"type\": \"location\"},\n {\"name\": \"house\", \"type\": \"location\"},\n {\"name\": \"Cat\", \"type\": \"animal\"},\n {\"name\": \"rose\", \"type\": \"location\"},\n {\"name\": \"poop\", \"type\": \"object\"},\n {\"name\": \"stick\", \"type\": \"object\"},\n {\"name\": \"teal crocodile toy\", \"type\": \"object\"},\n {\"name\": \"Sunday\", \"type\": \"day\"},\n {\"name\": \"10:30\", \"type\": \"time\"},\n {\"name\": \"narrator\", \"type\": \"person\"}\n ],\n \"edges\": [\n {\"subject\": \"narrator\", \"predicate\": \"woke up at\", \"object\": \"10:30\"},\n {\"subject\": \"narrator\", \"predicate\": \"worked on\", \"object\": \"extended mine project\"},\n {\"subject\": \"narrator\", \"predicate\": \"worked with\", \"object\": \"Claude AI\"},\n {\"subject\": \"narrator\", \"predicate\": \"washed\", \"object\": \"dishes\"},\n {\"subject\": \"dishes\", \"predicate\": \"were in\", \"object\": \"sink\"},\n {\"subject\": \"narrator\", \"predicate\": \"took on walk\", \"object\": \"Bird\"},\n {\"subject\": \"narrator\", \"predicate\": \"walked with\", \"object\": \"Cat\"},\n {\"subject\": \"walk\", \"predicate\": \"occurred through\", \"object\": \"apple orchard\"},\n {\"subject\": \"apple orchard\", \"predicate\": \"is next to\", \"object\": \"house\"},\n {\"subject\": \"Bird\", \"predicate\": \"is a\", \"object\": \"Dog\"},\n {\"subject\": \"Bird\", \"predicate\": \"defecated in\", \"object\": \"orchard\"},\n {\"subject\": \"Bird\", \"predicate\": \"chased\", \"object\": \"stick\"},\n {\"subject\": \"Bird\", \"predicate\": \"has\", \"object\": \"teal crocodile toy\"},\n {\"subject\": \"Bird\", \"predicate\": \"is following\", \"object\": \"narrator\"},\n {\"subject\": \"Bird\", \"predicate\": \"is squeaking at\", \"object\": \"narrator\"},\n {\"subject\": \"narrator\", \"predicate\": \"walked through\", \"object\": \"rose\"},\n {\"subject\": \"day\", \"predicate\": \"is\", \"object\": \"Sunday\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
},
|
||||
"_type": "voice_capture"
|
||||
},
|
||||
{
|
||||
"source": "2026-04-26-22-44-voice.md",
|
||||
"size_bucket": "small",
|
||||
"doc_chars_original": 165,
|
||||
"doc_chars_sent": 165,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 207,
|
||||
"output_tokens": 344,
|
||||
"latency_s": 1.98,
|
||||
"metrics": {
|
||||
"n_entities": 6,
|
||||
"n_edges": 6,
|
||||
"predicate_diversity": 6,
|
||||
"type_diversity": 5,
|
||||
"avg_degree": 2.0,
|
||||
"largest_component": 6,
|
||||
"largest_component_pct": 100.0
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Capture \u2014 2026-04-26-22-44\", \"type\": \"recording\"},\n {\"name\": \"voice\", \"type\": \"type\"},\n {\"name\": \"audio\", \"type\": \"modality\"},\n {\"name\": \"unprocessed\", \"type\": \"status\"},\n {\"name\": \"test field recording\", \"type\": \"recording\"},\n {\"name\": \"capture layer\", \"type\": \"system_layer\"}\n ],\n \"edges\": [\n {\"subject\": \"Capture \u2014 2026-04-26-22-44\", \"predicate\": \"has type\", \"object\": \"voice\"},\n {\"subject\": \"Capture \u2014 2026-04-26-22-44\", \"predicate\": \"has modality\", \"object\": \"audio\"},\n {\"subject\": \"Capture \u2014 2026-04-26-22-44\", \"predicate\": \"has status\", \"object\": \"unprocessed\"},\n {\"subject\": \"Capture \u2014 2026-04-26-22-44\", \"predicate\": \"is a\", \"object\": \"test field recording\"},\n {\"subject\": \"test field recording\", \"predicate\": \"exists on\", \"object\": \"capture layer\"},\n {\"subject\": \"Capture \u2014 2026-04-26-22-44\", \"predicate\": \"is the first recording of\", \"object\": \"test field recording\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 14.45,
|
||||
"local_metadata": {
|
||||
"language": "en",
|
||||
"char_length": 137,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": false,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": true
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": false,
|
||||
"has_technical_terminology": false,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "creative",
|
||||
"one_sentence_summary": "Unprocessed audio test field recording"
|
||||
},
|
||||
"local_raw": "{\n \"language\": \"en\",\n \"char_length\": 137,\n \"primary_format\": \"prose\",\n \"structural_signals\": {\n \"has_headings\": false,\n \"has_bullet_lists\": false,\n \"has_numbered_lists\": false,\n \"has_tables\": false,\n \"has_code_blocks\": false,\n \"has_dates\": true\n },\n \"content_signals\": {\n \"has_named_people\": false,\n \"has_institutional_language\": false,\n \"has_technical_terminology\": false,\n \"has_first_person\": false,\n \"has_quotations\": false\n },\n \"domain_class\": \"creative\",\n \"one_sentence_summary\": \"Unprocessed audio test field recording\"\n}",
|
||||
"api_input_tokens": 459,
|
||||
"api_output_tokens": 369,
|
||||
"api_latency_s": 2.2,
|
||||
"metrics": {
|
||||
"n_entities": 7,
|
||||
"n_edges": 6,
|
||||
"predicate_diversity": 6,
|
||||
"type_diversity": 7,
|
||||
"avg_degree": 1.71,
|
||||
"largest_component": 7,
|
||||
"largest_component_pct": 100.0
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Capture \u2014 2026-04-26-22-44\", \"type\": \"recording\"},\n {\"name\": \"voice\", \"type\": \"type\"},\n {\"name\": \"audio\", \"type\": \"modality\"},\n {\"name\": \"unprocessed\", \"type\": \"status\"},\n {\"name\": \"test field recording\", \"type\": \"recording_classification\"},\n {\"name\": \"capture layer\", \"type\": \"layer\"},\n {\"name\": \"first recording\", \"type\": \"recording_ordinal\"}\n ],\n \"edges\": [\n {\"subject\": \"Capture \u2014 2026-04-26-22-44\", \"predicate\": \"has type\", \"object\": \"voice\"},\n {\"subject\": \"Capture \u2014 2026-04-26-22-44\", \"predicate\": \"has modality\", \"object\": \"audio\"},\n {\"subject\": \"Capture \u2014 2026-04-26-22-44\", \"predicate\": \"has status\", \"object\": \"unprocessed\"},\n {\"subject\": \"Capture \u2014 2026-04-26-22-44\", \"predicate\": \"is classified as\", \"object\": \"test field recording\"},\n {\"subject\": \"Capture \u2014 2026-04-26-22-44\", \"predicate\": \"exists on\", \"object\": \"capture layer\"},\n {\"subject\": \"Capture \u2014 2026-04-26-22-44\", \"predicate\": \"is\", \"object\": \"first recording\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
},
|
||||
"_type": "voice_capture"
|
||||
},
|
||||
{
|
||||
"source": "07_Cube Assignment_2018f.docx",
|
||||
"size_bucket": "medium",
|
||||
"doc_chars_original": 1316,
|
||||
"doc_chars_sent": 1316,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 463,
|
||||
"output_tokens": 902,
|
||||
"latency_s": 4.36,
|
||||
"metrics": {
|
||||
"n_entities": 23,
|
||||
"n_edges": 19,
|
||||
"predicate_diversity": 12,
|
||||
"type_diversity": 13,
|
||||
"avg_degree": 1.65,
|
||||
"largest_component": 13,
|
||||
"largest_component_pct": 56.5
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Cube\", \"type\": \"Object\"},\n {\"name\": \"Blue Cube\", \"type\": \"Object\"},\n {\"name\": \"Blue Layer\", \"type\": \"Layer\"},\n {\"name\": \"Measurements\", \"type\": \"Specification\"},\n {\"name\": \"1mm\", \"type\": \"Unit\"},\n {\"name\": \"0.04 inches\", \"type\": \"Unit\"},\n {\"name\": \"Initials\", \"type\": \"Element\"},\n {\"name\": \"Four view schematic\", \"type\": \"Drawing\"},\n {\"name\": \"2D drawing\", \"type\": \"Drawing\"},\n {\"name\": \"Dimensions\", \"type\": \"Specification\"},\n {\"name\": \"STL file\", \"type\": \"File Format\"},\n {\"name\": \"Center\", \"type\": \"Location\"},\n {\"name\": \"Standard layers\", \"type\": \"Setting\"},\n {\"name\": \"Infill\", \"type\": \"Setting\"},\n {\"name\": \"Raft\", \"type\": \"Setting\"},\n {\"name\": \"Supports\", \"type\": \"Setting\"},\n {\"name\": \"Rhino file\", \"type\": \"File Format\"},\n {\"name\": \"Blackboard\", \"type\": \"Platform\"},\n {\"name\": \"Colored layer\", \"type\": \"Layer\"},\n {\"name\": \"Labeled schematic\", \"type\": \"Drawing\"},\n {\"name\": \"Grade\", \"type\": \"Assessment\"},\n {\"name\": \"Letter grade\", \"type\": \"Assessment Unit\"},\n {\"name\": \"Re-submission\", \"type\": \"Process\"}\n ],\n \"edges\": [\n {\"subject\": \"Blue Cube\", \"predicate\": \"should be placed on\", \"object\": \"Blue Layer\"},\n {\"subject\": \"Cube\", \"predicate\": \"measurements should be accurate to within\", \"object\": \"1mm\"},\n {\"subject\": \"Cube\", \"predicate\": \"measurements should be accurate to within\", \"object\": \"0.04 inches\"},\n {\"subject\": \"Cube\", \"predicate\": \"should include\", \"object\": \"Initials\"},\n {\"subject\": \"Initials\", \"predicate\": \"should be\", \"object\": \"Legible\"},\n {\"subject\": \"Cube\", \"predicate\": \"requires\", \"object\": \"Four view schematic\"},\n {\"subject\": \"Four view schematic\", \"predicate\": \"is a\", \"object\": \"2D drawing\"},\n {\"subject\": \"2D drawing\", \"predicate\": \"should include\", \"object\": \"Dimensions\"},\n {\"subject\": \"Cube\", \"predicate\": \"should be exported as\", \"object\": \"STL file\"},\n {\"subject\": \"STL file\", \"predicate\": \"should be submitted to\", \"object\": \"Center\"},\n {\"subject\": \"Cube\", \"predicate\": \"should use settings\", \"object\": \"Standard layers\"},\n {\"subject\": \"Cube\", \"predicate\": \"should use settings\", \"object\": \"Infill\"},\n {\"subject\": \"Cube\", \"predicate\": \"should not use\", \"object\": \"Raft\"},\n {\"subject\": \"Cube\", \"predicate\": \"should not use\", \"object\": \"Supports\"},\n {\"subject\": \"Rhino file\", \"predicate\": \"should be submitted to\", \"object\": \"Blackboard\"},\n {\"subject\": \"Rhino file\", \"predicate\": \"should include\", \"object\": \"Colored layer\"},\n {\"subject\": \"Rhino file\", \"predicate\": \"should include\", \"object\": \"Labeled schematic\"},\n {\"subject\": \"Late submission\", \"predicate\": \"results in\", \"object\": \"Grade reduction by letter grade\"},\n {\"subject\": \"Re-submission\", \"predicate\": \"is not allowed for\", \"object\": \"Better grade\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 22.85,
|
||||
"local_metadata": {
|
||||
"language": "en",
|
||||
"char_length": 230,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": false,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "technical",
|
||||
"one_sentence_summary": "Instructions for creating and submitting cubes in a specific format"
|
||||
},
|
||||
"local_raw": "{\n \"language\": \"en\",\n \"char_length\": 230,\n \"primary_format\": \"prose\",\n \"structural_signals\": {\n \"has_headings\": false,\n \"has_bullet_lists\": false,\n \"has_numbered_lists\": false,\n \"has_tables\": false,\n \"has_code_blocks\": false,\n \"has_dates\": false\n },\n \"content_signals\": {\n \"has_named_people\": false,\n \"has_institutional_language\": true,\n \"has_technical_terminology\": true,\n \"has_first_person\": false,\n \"has_quotations\": false\n },\n \"domain_class\": \"technical\",\n \"one_sentence_summary\": \"Instructions for creating and submitting cubes in a specific format\"\n}",
|
||||
"api_input_tokens": 719,
|
||||
"api_output_tokens": 1229,
|
||||
"api_latency_s": 5.91,
|
||||
"metrics": {
|
||||
"n_entities": 32,
|
||||
"n_edges": 27,
|
||||
"predicate_diversity": 19,
|
||||
"type_diversity": 16,
|
||||
"avg_degree": 1.69,
|
||||
"largest_component": 12,
|
||||
"largest_component_pct": 37.5
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"cubes\", \"type\": \"object\"},\n {\"name\": \"measurements\", \"type\": \"attribute\"},\n {\"name\": \"1mm\", \"type\": \"measurement_value\"},\n {\"name\": \"0.04 inches\", \"type\": \"measurement_value\"},\n {\"name\": \"scale\", \"type\": \"attribute\"},\n {\"name\": \"layer\", \"type\": \"structural_element\"},\n {\"name\": \"color\", \"type\": \"attribute\"},\n {\"name\": \"Blue Cube\", \"type\": \"object\"},\n {\"name\": \"Blue Layer\", \"type\": \"structural_element\"},\n {\"name\": \"initials\", \"type\": \"content\"},\n {\"name\": \"legibility\", \"type\": \"attribute\"},\n {\"name\": \"four view schematic\", \"type\": \"document_type\"},\n {\"name\": \"2D drawing\", \"type\": \"document_type\"},\n {\"name\": \"dimensions\", \"type\": \"attribute\"},\n {\"name\": \"details\", \"type\": \"information\"},\n {\"name\": \"additional views\", \"type\": \"document_type\"},\n {\"name\": \"STL format\", \"type\": \"file_format\"},\n {\"name\": \"center\", \"type\": \"location\"},\n {\"name\": \"orientation\", \"type\": \"attribute\"},\n {\"name\": \"standard layers\", \"type\": \"setting\"},\n {\"name\": \"infill\", \"type\": \"setting\"},\n {\"name\": \"raft\", \"type\": \"setting\"},\n {\"name\": \"supports\", \"type\": \"setting\"},\n {\"name\": \"Rhino file\", \"type\": \"file_type\"},\n {\"name\": \"blackboard\", \"type\": \"platform\"},\n {\"name\": \"printed cubes\", \"type\": \"object\"},\n {\"name\": \"colored layer\", \"type\": \"structural_element\"},\n {\"name\": \"labeled schematic\", \"type\": \"document_type\"},\n {\"name\": \"submission deadline\", \"type\": \"constraint\"},\n {\"name\": \"grade\", \"type\": \"evaluation_metric\"},\n {\"name\": \"letter grade\", \"type\": \"evaluation_unit\"},\n {\"name\": \"re-submission\", \"type\": \"action\"}\n ],\n \"edges\": [\n {\"subject\": \"cubes\", \"predicate\": \"should be replicated with\", \"object\": \"measurements\"},\n {\"subject\": \"measurements\", \"predicate\": \"should be accurate to within\", \"object\": \"1mm\"},\n {\"subject\": \"measurements\", \"predicate\": \"should be accurate to within\", \"object\": \"0.04 inches\"},\n {\"subject\": \"cubes\", \"predicate\": \"should be to\", \"object\": \"scale\"},\n {\"subject\": \"Blue Cube\", \"predicate\": \"should be placed on\", \"object\": \"Blue Layer\"},\n {\"subject\": \"cubes\", \"predicate\": \"should include\", \"object\": \"initials\"},\n {\"subject\": \"initials\", \"predicate\": \"should be\", \"object\": \"legibility\"},\n {\"subject\": \"initials\", \"predicate\": \"should be at\", \"object\": \"appropriate scale\"},\n {\"subject\": \"cubes\", \"predicate\": \"requires\", \"object\": \"four view schematic\"},\n {\"subject\": \"four view schematic\", \"predicate\": \"is a\", \"object\": \"2D drawing\"},\n {\"subject\": \"2D drawing\", \"predicate\": \"should include\", \"object\": \"dimensions\"},\n {\"subject\": \"dimensions\", \"predicate\": \"should indicate\", \"object\": \"details\"},\n {\"subject\": \"drawing\", \"predicate\": \"should be on\", \"object\": \"colored layer\"},\n {\"subject\": \"cubes\", \"predicate\": \"may require\", \"object\": \"additional views\"},\n {\"subject\": \"finished cubes\", \"predicate\": \"should be exported as\", \"object\": \"STL format\"},\n {\"subject\": \"exported cubes\", \"predicate\": \"should be submitted to\", \"object\": \"center\"},\n {\"subject\": \"cubes\", \"predicate\": \"should be oriented in same manner as\", \"object\": \"originals\"},\n {\"subject\": \"settings\", \"predicate\": \"should include\", \"object\": \"standard layers\"},\n {\"subject\": \"settings\", \"predicate\": \"should include\", \"object\": \"infill\"},\n {\"subject\": \"settings\", \"predicate\": \"should not include\", \"object\": \"raft\"},\n {\"subject\": \"settings\", \"predicate\": \"should not include\", \"object\": \"supports\"},\n {\"subject\": \"Rhino file\", \"predicate\": \"should be submitted to\", \"object\": \"blackboard\"},\n {\"subject\": \"printed cubes\", \"predicate\": \"should be submitted with\", \"object\": \"Rhino file\"},\n {\"subject\": \"Rhino file\", \"predicate\": \"should include\", \"object\": \"colored layer\"},\n {\"subject\": \"Rhino file\", \"predicate\": \"should include\", \"object\": \"labeled schematic\"},\n {\"subject\": \"late submission\", \"predicate\": \"results in grade reduction of\", \"object\": \"letter grade\"},\n {\"subject\":",
|
||||
"error": null
|
||||
},
|
||||
"_type": "course_module"
|
||||
},
|
||||
{
|
||||
"source": "06_Gumball.docx",
|
||||
"size_bucket": "medium",
|
||||
"doc_chars_original": 1980,
|
||||
"doc_chars_sent": 1980,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 652,
|
||||
"output_tokens": 1467,
|
||||
"latency_s": 6.64,
|
||||
"metrics": {
|
||||
"n_entities": 34,
|
||||
"n_edges": 30,
|
||||
"predicate_diversity": 17,
|
||||
"type_diversity": 16,
|
||||
"avg_degree": 1.76,
|
||||
"largest_component": 7,
|
||||
"largest_component_pct": 20.6
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Gumball\", \"type\": \"Tool\"},\n {\"name\": \"Widget\", \"type\": \"Component\"},\n {\"name\": \"Selected object\", \"type\": \"Entity\"},\n {\"name\": \"Move transformation\", \"type\": \"Transformation\"},\n {\"name\": \"Scale transformation\", \"type\": \"Transformation\"},\n {\"name\": \"Rotate transformation\", \"type\": \"Transformation\"},\n {\"name\": \"Gumball origin\", \"type\": \"Reference point\"},\n {\"name\": \"Gumball menu\", \"type\": \"Interface\"},\n {\"name\": \"White menu ball\", \"type\": \"UI element\"},\n {\"name\": \"Relocate Gumball\", \"type\": \"Menu option\"},\n {\"name\": \"Gumball center\", \"type\": \"Reference point\"},\n {\"name\": \"Reset Gumball\", \"type\": \"Menu option\"},\n {\"name\": \"Area centroid\", \"type\": \"Reference point\"},\n {\"name\": \"Align to CPlane\", \"type\": \"Menu option\"},\n {\"name\": \"Align to Object\", \"type\": \"Menu option\"},\n {\"name\": \"Align to World\", \"type\": \"Menu option\"},\n {\"name\": \"CPlane\", \"type\": \"Coordinate system\"},\n {\"name\": \"X axis\", \"type\": \"Axis\"},\n {\"name\": \"Y axis\", \"type\": \"Axis\"},\n {\"name\": \"Z axis\", \"type\": \"Axis\"},\n {\"name\": \"U curve\", \"type\": \"Object property\"},\n {\"name\": \"V curve\", \"type\": \"Object property\"},\n {\"name\": \"World grid system\", \"type\": \"Coordinate system\"},\n {\"name\": \"Snappy dragging\", \"type\": \"Dragging mode\"},\n {\"name\": \"Smooth dragging\", \"type\": \"Dragging mode\"},\n {\"name\": \"Osnaps\", \"type\": \"Feature\"},\n {\"name\": \"Drag strength\", \"type\": \"Parameter\"},\n {\"name\": \"Gumball elements\", \"type\": \"Component\"},\n {\"name\": \"Red color\", \"type\": \"Color\"},\n {\"name\": \"Green color\", \"type\": \"Color\"},\n {\"name\": \"Blue color\", \"type\": \"Color\"},\n {\"name\": \"Dotted lines\", \"type\": \"Visual element\"},\n {\"name\": \"Arrow\", \"type\": \"Visual element\"},\n {\"name\": \"Box\", \"type\": \"Visual element\"}\n ],\n \"edges\": [\n {\"subject\": \"Gumball\", \"predicate\": \"is a\", \"object\": \"Widget\"},\n {\"subject\": \"Gumball\", \"predicate\": \"is used to facilitate\", \"object\": \"Direct editing\"},\n {\"subject\": \"Gumball\", \"predicate\": \"provides\", \"object\": \"Move transformation\"},\n {\"subject\": \"Gumball\", \"predicate\": \"provides\", \"object\": \"Scale transformation\"},\n {\"subject\": \"Gumball\", \"predicate\": \"provides\", \"object\": \"Rotate transformation\"},\n {\"subject\": \"Gumball\", \"predicate\": \"operates around\", \"object\": \"Gumball origin\"},\n {\"subject\": \"Gumball menu\", \"predicate\": \"is accessed by\", \"object\": \"White menu ball\"},\n {\"subject\": \"Relocate Gumball\", \"predicate\": \"allows redefinition of\", \"object\": \"Gumball center\"},\n {\"subject\": \"Gumball center\", \"predicate\": \"is the origin point for\", \"object\": \"Scaling\"},\n {\"subject\": \"Gumball center\", \"predicate\": \"is the origin point for\", \"object\": \"Translating\"},\n {\"subject\": \"Gumball center\", \"predicate\": \"is the origin point for\", \"object\": \"Rotating\"},\n {\"subject\": \"Reset Gumball\", \"predicate\": \"places gumball at\", \"object\": \"Area centroid\"},\n {\"subject\": \"Align to CPlane\", \"predicate\": \"makes arrows point in direction of\", \"object\": \"X axis\"},\n {\"subject\": \"Align to CPlane\", \"predicate\": \"makes arrows point in direction of\", \"object\": \"Y axis\"},\n {\"subject\": \"Align to CPlane\", \"predicate\": \"makes arrows point in direction of\", \"object\": \"Z axis\"},\n {\"subject\": \"Align to Object\", \"predicate\": \"relates gumball to\", \"object\": \"U curve\"},\n {\"subject\": \"Align to Object\", \"predicate\": \"relates gumball to\", \"object\": \"V curve\"},\n {\"subject\": \"Align to World\", \"predicate\": \"relates gumball to\", \"object\": \"World grid system\"},\n {\"subject\": \"Smooth dragging\", \"predicate\": \"allows dragging without\", \"object\": \"Restraints\"},\n {\"subject\": \"Snappy dragging\", \"predicate\": \"is equivalent to\", \"object\": \"Osnaps\"},\n {\"subject\": \"Snappy dragging\", \"predicate\": \"causes object to snap to\", \"object\": \"End point\"},\n {\"subject\": \"Snappy dragging\", \"predicate\": \"causes object to snap to\", \"object\": \"Mid point\"},\n {\"subject\": \"Snappy dragging\", \"predicate\": \"causes object to snap to\", \"objec",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 29.12,
|
||||
"local_metadata": {
|
||||
"language": "en",
|
||||
"char_length": 450,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": false,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "technical",
|
||||
"one_sentence_summary": "Instructional text about the functionality and usage of a gumball tool in a 3D modeling software"
|
||||
},
|
||||
"local_raw": "{\n \"language\": \"en\",\n \"char_length\": 450,\n \"primary_format\": \"prose\",\n \"structural_signals\": {\n \"has_headings\": false,\n \"has_bullet_lists\": false,\n \"has_numbered_lists\": false,\n \"has_tables\": false,\n \"has_code_blocks\": false,\n \"has_dates\": false\n },\n \"content_signals\": {\n \"has_named_people\": false,\n \"has_institutional_language\": true,\n \"has_technical_terminology\": true,\n \"has_first_person\": false,\n \"has_quotations\": false\n },\n \"domain_class\": \"technical\",\n \"one_sentence_summary\": \"Instructional text about the functionality and usage of a gumball tool in a 3D modeling software\"\n}",
|
||||
"api_input_tokens": 918,
|
||||
"api_output_tokens": 1816,
|
||||
"api_latency_s": 8.19,
|
||||
"metrics": {
|
||||
"n_entities": 50,
|
||||
"n_edges": 35,
|
||||
"predicate_diversity": 21,
|
||||
"type_diversity": 25,
|
||||
"avg_degree": 1.4,
|
||||
"largest_component": 8,
|
||||
"largest_component_pct": 16.0
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Gumball\", \"type\": \"Tool\"},\n {\"name\": \"widget\", \"type\": \"Component\"},\n {\"name\": \"selected object\", \"type\": \"Entity\"},\n {\"name\": \"direct editing\", \"type\": \"Function\"},\n {\"name\": \"move transformation\", \"type\": \"Transformation\"},\n {\"name\": \"scale transformation\", \"type\": \"Transformation\"},\n {\"name\": \"rotate transformation\", \"type\": \"Transformation\"},\n {\"name\": \"gumball origin\", \"type\": \"Reference Point\"},\n {\"name\": \"Gumball Menu\", \"type\": \"Interface\"},\n {\"name\": \"white menu ball\", \"type\": \"UI Element\"},\n {\"name\": \"Relocate Gumball\", \"type\": \"Menu Option\"},\n {\"name\": \"gumball center\", \"type\": \"Reference Point\"},\n {\"name\": \"origin point\", \"type\": \"Reference Point\"},\n {\"name\": \"scaling\", \"type\": \"Operation\"},\n {\"name\": \"translating\", \"type\": \"Operation\"},\n {\"name\": \"rotating\", \"type\": \"Operation\"},\n {\"name\": \"Reset Gumball\", \"type\": \"Menu Option\"},\n {\"name\": \"area centroid\", \"type\": \"Reference Point\"},\n {\"name\": \"object orientation\", \"type\": \"Property\"},\n {\"name\": \"Align to CPlane\", \"type\": \"Menu Option\"},\n {\"name\": \"Align to Object\", \"type\": \"Menu Option\"},\n {\"name\": \"Align to World\", \"type\": \"Menu Option\"},\n {\"name\": \"CPlane\", \"type\": \"Reference System\"},\n {\"name\": \"arrows\", \"type\": \"UI Element\"},\n {\"name\": \"x axis\", \"type\": \"Axis\"},\n {\"name\": \"y axis\", \"type\": \"Axis\"},\n {\"name\": \"z axis\", \"type\": \"Axis\"},\n {\"name\": \"U curves\", \"type\": \"Object Property\"},\n {\"name\": \"V curves\", \"type\": \"Object Property\"},\n {\"name\": \"world grid system\", \"type\": \"Reference System\"},\n {\"name\": \"Snappy Dragging\", \"type\": \"Mode\"},\n {\"name\": \"Smooth Dragging\", \"type\": \"Mode\"},\n {\"name\": \"Osnaps\", \"type\": \"Feature\"},\n {\"name\": \"snap point\", \"type\": \"Reference Point\"},\n {\"name\": \"end snap\", \"type\": \"Snap Type\"},\n {\"name\": \"mid snap\", \"type\": \"Snap Type\"},\n {\"name\": \"center snap\", \"type\": \"Snap Type\"},\n {\"name\": \"Drag Strength\", \"type\": \"Parameter\"},\n {\"name\": \"mouse movement\", \"type\": \"Input\"},\n {\"name\": \"1:1 ratio\", \"type\": \"Ratio\"},\n {\"name\": \"Gumball Elements\", \"type\": \"Component Group\"},\n {\"name\": \"color\", \"type\": \"Property\"},\n {\"name\": \"axis assignment\", \"type\": \"Mapping\"},\n {\"name\": \"Red\", \"type\": \"Color\"},\n {\"name\": \"Green\", \"type\": \"Color\"},\n {\"name\": \"Blue\", \"type\": \"Color\"},\n {\"name\": \"Z plane\", \"type\": \"Dimension\"},\n {\"name\": \"Z dimension\", \"type\": \"Dimension\"},\n {\"name\": \"dotted lines\", \"type\": \"UI Element\"},\n {\"name\": \"Gumball Short-Hands and Tricks\", \"type\": \"Section\"}\n ],\n \"edges\": [\n {\"subject\": \"Gumball\", \"predicate\": \"is a\", \"object\": \"widget\"},\n {\"subject\": \"Gumball\", \"predicate\": \"is used for\", \"object\": \"direct editing\"},\n {\"subject\": \"Gumball\", \"predicate\": \"operates on\", \"object\": \"selected object\"},\n {\"subject\": \"Gumball\", \"predicate\": \"provides\", \"object\": \"move transformation\"},\n {\"subject\": \"Gumball\", \"predicate\": \"provides\", \"object\": \"scale transformation\"},\n {\"subject\": \"Gumball\", \"predicate\": \"provides\", \"object\": \"rotate transformation\"},\n {\"subject\": \"move transformation\", \"predicate\": \"occurs around\", \"object\": \"gumball origin\"},\n {\"subject\": \"scale transformation\", \"predicate\": \"occurs around\", \"object\": \"gumball origin\"},\n {\"subject\": \"rotate transformation\", \"predicate\": \"occurs around\", \"object\": \"gumball origin\"},\n {\"subject\": \"white menu ball\", \"predicate\": \"opens\", \"object\": \"Gumball Menu\"},\n {\"subject\": \"Relocate Gumball\", \"predicate\": \"allows redefinition of\", \"object\": \"gumball center\"},\n {\"subject\": \"gumball center\", \"predicate\": \"serves as\", \"object\": \"origin point\"},\n {\"subject\": \"origin point\", \"predicate\": \"is reference for\", \"object\": \"scaling\"},\n {\"subject\": \"origin point\", \"predicate\": \"is reference for\", \"object\": \"translating\"},\n {\"subject\": \"origin point\", \"predicate\": \"is reference for\", \"object\": \"rotating\"},\n {\"subject\": \"Reset Gumball\", \"predicate\": \"places gumball a",
|
||||
"error": null
|
||||
},
|
||||
"_type": "course_module"
|
||||
},
|
||||
{
|
||||
"source": "00_Syllabus.docx",
|
||||
"size_bucket": "medium",
|
||||
"doc_chars_original": 2273,
|
||||
"doc_chars_sent": 2273,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 607,
|
||||
"output_tokens": 1199,
|
||||
"latency_s": 5.55,
|
||||
"metrics": {
|
||||
"n_entities": 35,
|
||||
"n_edges": 25,
|
||||
"predicate_diversity": 15,
|
||||
"type_diversity": 15,
|
||||
"avg_degree": 1.43,
|
||||
"largest_component": 17,
|
||||
"largest_component_pct": 48.6
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"3D Computer Aided Design drawing\", \"type\": \"subject\"},\n {\"name\": \"rapid manufacturing\", \"type\": \"subject\"},\n {\"name\": \"Rhino3D\", \"type\": \"software\"},\n {\"name\": \"students\", \"type\": \"agent\"},\n {\"name\": \"virtual spaces\", \"type\": \"concept\"},\n {\"name\": \"CAD software\", \"type\": \"software\"},\n {\"name\": \"NURBS geometry\", \"type\": \"concept\"},\n {\"name\": \"3D printing\", \"type\": \"process\"},\n {\"name\": \"digital fabrication software\", \"type\": \"software\"},\n {\"name\": \"digital fabrication hardware\", \"type\": \"hardware\"},\n {\"name\": \"spatial awareness\", \"type\": \"skill\"},\n {\"name\": \"scales\", \"type\": \"concept\"},\n {\"name\": \"measuring devices\", \"type\": \"tool\"},\n {\"name\": \"physical dimensions\", \"type\": \"concept\"},\n {\"name\": \"virtual space\", \"type\": \"concept\"},\n {\"name\": \"three dimensional form\", \"type\": \"concept\"},\n {\"name\": \"digital modeling\", \"type\": \"process\"},\n {\"name\": \"three dimensional design principals\", \"type\": \"concept\"},\n {\"name\": \"Desktop FDM printing\", \"type\": \"process\"},\n {\"name\": \"printing process\", \"type\": \"process\"},\n {\"name\": \"tolerances\", \"type\": \"concept\"},\n {\"name\": \"critical thinking\", \"type\": \"skill\"},\n {\"name\": \"problem solving skills\", \"type\": \"skill\"},\n {\"name\": \"research methods\", \"type\": \"methodology\"},\n {\"name\": \"assignments\", \"type\": \"task\"},\n {\"name\": \"Rhino Level 1 Training Guide\", \"type\": \"resource\"},\n {\"name\": \"grading system\", \"type\": \"system\"},\n {\"name\": \"attendance\", \"type\": \"requirement\"},\n {\"name\": \"grade\", \"type\": \"metric\"},\n {\"name\": \"planning\", \"type\": \"skill\"},\n {\"name\": \"project completion\", \"type\": \"task\"},\n {\"name\": \"design elements\", \"type\": \"concept\"},\n {\"name\": \"design principles\", \"type\": \"concept\"},\n {\"name\": \"original thinking\", \"type\": \"skill\"},\n {\"name\": \"course material\", \"type\": \"content\"}\n ],\n \"edges\": [\n {\"subject\": \"course\", \"predicate\": \"introduces\", \"object\": \"3D Computer Aided Design drawing\"},\n {\"subject\": \"course\", \"predicate\": \"introduces\", \"object\": \"rapid manufacturing\"},\n {\"subject\": \"course\", \"predicate\": \"uses\", \"object\": \"Rhino3D\"},\n {\"subject\": \"students\", \"predicate\": \"become acquainted with\", \"object\": \"virtual spaces\"},\n {\"subject\": \"students\", \"predicate\": \"become acquainted with\", \"object\": \"NURBS geometry\"},\n {\"subject\": \"students\", \"predicate\": \"gain hands on technical skills with\", \"object\": \"digital fabrication software\"},\n {\"subject\": \"students\", \"predicate\": \"gain hands on technical skills with\", \"object\": \"digital fabrication hardware\"},\n {\"subject\": \"students\", \"predicate\": \"demonstrate\", \"object\": \"spatial awareness\"},\n {\"subject\": \"students\", \"predicate\": \"use\", \"object\": \"scales\"},\n {\"subject\": \"students\", \"predicate\": \"use\", \"object\": \"measuring devices\"},\n {\"subject\": \"students\", \"predicate\": \"translate\", \"object\": \"physical dimensions into virtual space\"},\n {\"subject\": \"students\", \"predicate\": \"gain working knowledge of\", \"object\": \"three dimensional form\"},\n {\"subject\": \"students\", \"predicate\": \"apply\", \"object\": \"three dimensional design principals\"},\n {\"subject\": \"students\", \"predicate\": \"understand\", \"object\": \"Desktop FDM printing\"},\n {\"subject\": \"students\", \"predicate\": \"understand\", \"object\": \"printing process\"},\n {\"subject\": \"students\", \"predicate\": \"understand and apply\", \"object\": \"tolerances\"},\n {\"subject\": \"students\", \"predicate\": \"use\", \"object\": \"critical thinking\"},\n {\"subject\": \"students\", \"predicate\": \"use\", \"object\": \"problem solving skills\"},\n {\"subject\": \"students\", \"predicate\": \"use\", \"object\": \"research methods\"},\n {\"subject\": \"NURBS geometry\", \"predicate\": \"enables output of\", \"object\": \"3D printing\"},\n {\"subject\": \"late assignment\", \"predicate\": \"results in grade reduction of\", \"object\": \"10%\"},\n {\"subject\": \"grading system\", \"predicate\": \"evaluates\", \"object\": \"planning\"},\n {\"subject\": \"grading system\", \"predicate\": \"evaluates\", \"obj",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 28.53,
|
||||
"local_metadata": {
|
||||
"language": "en",
|
||||
"char_length": 320,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "This document outlines a course on 3D Computer Aided Design and rapid manufacturing using Rhino3D software."
|
||||
},
|
||||
"local_raw": "{\n \"language\": \"en\",\n \"char_length\": 320,\n \"primary_format\": \"prose\",\n \"structural_signals\": {\n \"has_headings\": true,\n \"has_bullet_lists\": false,\n \"has_numbered_lists\": false,\n \"has_tables\": false,\n \"has_code_blocks\": false,\n \"has_dates\": false\n },\n \"content_signals\": {\n \"has_named_people\": false,\n \"has_institutional_language\": true,\n \"has_technical_terminology\": true,\n \"has_first_person\": false,\n \"has_quotations\": false\n },\n \"domain_class\": \"educational\",\n \"one_sentence_summary\": \"This document outlines a course on 3D Computer Aided Design and rapid manufacturing using Rhino3D software.\"\n}",
|
||||
"api_input_tokens": 875,
|
||||
"api_output_tokens": 1493,
|
||||
"api_latency_s": 6.75,
|
||||
"metrics": {
|
||||
"n_entities": 34,
|
||||
"n_edges": 38,
|
||||
"predicate_diversity": 19,
|
||||
"type_diversity": 16,
|
||||
"avg_degree": 2.24,
|
||||
"largest_component": 3,
|
||||
"largest_component_pct": 8.8
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"3D Computer Aided Design\", \"type\": \"subject\"},\n {\"name\": \"rapid manufacturing\", \"type\": \"subject\"},\n {\"name\": \"Rhino3D\", \"type\": \"software\"},\n {\"name\": \"CAD software\", \"type\": \"software_category\"},\n {\"name\": \"NURBS geometry\", \"type\": \"technical_concept\"},\n {\"name\": \"3D printing\", \"type\": \"technology\"},\n {\"name\": \"digital fabrication software\", \"type\": \"software_category\"},\n {\"name\": \"digital fabrication hardware\", \"type\": \"hardware_category\"},\n {\"name\": \"spatial awareness\", \"type\": \"skill\"},\n {\"name\": \"scales\", \"type\": \"concept\"},\n {\"name\": \"measuring devices\", \"type\": \"tool\"},\n {\"name\": \"formula\", \"type\": \"concept\"},\n {\"name\": \"physical dimensions\", \"type\": \"concept\"},\n {\"name\": \"virtual space\", \"type\": \"concept\"},\n {\"name\": \"three dimensional form\", \"type\": \"concept\"},\n {\"name\": \"digital modeling\", \"type\": \"skill\"},\n {\"name\": \"three dimensional design principals\", \"type\": \"concept\"},\n {\"name\": \"Desktop FDM printing\", \"type\": \"technology\"},\n {\"name\": \"tolerances\", \"type\": \"concept\"},\n {\"name\": \"critical thinking\", \"type\": \"skill\"},\n {\"name\": \"problem solving skills\", \"type\": \"skill\"},\n {\"name\": \"research methods\", \"type\": \"methodology\"},\n {\"name\": \"Rhino Level 1 Training Guide\", \"type\": \"learning_material\"},\n {\"name\": \"assignments\", \"type\": \"assessment\"},\n {\"name\": \"homework\", \"type\": \"assessment\"},\n {\"name\": \"grading\", \"type\": \"assessment_method\"},\n {\"name\": \"attendance\", \"type\": \"assessment_criterion\"},\n {\"name\": \"grade A\", \"type\": \"grade\"},\n {\"name\": \"grade B\", \"type\": \"grade\"},\n {\"name\": \"grade C\", \"type\": \"grade\"},\n {\"name\": \"grade D\", \"type\": \"grade\"},\n {\"name\": \"grade F\", \"type\": \"grade\"},\n {\"name\": \"project\", \"type\": \"deliverable\"},\n {\"name\": \"elements and principles of design\", \"type\": \"concept\"}\n ],\n \"edges\": [\n {\"subject\": \"course\", \"predicate\": \"introduces\", \"object\": \"3D Computer Aided Design\"},\n {\"subject\": \"course\", \"predicate\": \"introduces\", \"object\": \"rapid manufacturing\"},\n {\"subject\": \"course\", \"predicate\": \"uses\", \"object\": \"Rhino3D\"},\n {\"subject\": \"students\", \"predicate\": \"become acquainted with\", \"object\": \"virtual space\"},\n {\"subject\": \"students\", \"predicate\": \"become acquainted with\", \"object\": \"NURBS geometry\"},\n {\"subject\": \"students\", \"predicate\": \"gain skills in\", \"object\": \"digital fabrication software\"},\n {\"subject\": \"students\", \"predicate\": \"gain skills in\", \"object\": \"digital fabrication hardware\"},\n {\"subject\": \"students\", \"predicate\": \"demonstrate\", \"object\": \"spatial awareness\"},\n {\"subject\": \"students\", \"predicate\": \"use\", \"object\": \"scales\"},\n {\"subject\": \"students\", \"predicate\": \"use\", \"object\": \"measuring devices\"},\n {\"subject\": \"students\", \"predicate\": \"apply\", \"object\": \"formula\"},\n {\"subject\": \"formula\", \"predicate\": \"translates\", \"object\": \"physical dimensions\"},\n {\"subject\": \"physical dimensions\", \"predicate\": \"converted to\", \"object\": \"virtual space\"},\n {\"subject\": \"students\", \"predicate\": \"gain knowledge of\", \"object\": \"three dimensional form\"},\n {\"subject\": \"students\", \"predicate\": \"gain knowledge of\", \"object\": \"digital modeling\"},\n {\"subject\": \"students\", \"predicate\": \"apply\", \"object\": \"three dimensional design principals\"},\n {\"subject\": \"students\", \"predicate\": \"understand\", \"object\": \"Desktop FDM printing\"},\n {\"subject\": \"students\", \"predicate\": \"understand\", \"object\": \"printing process\"},\n {\"subject\": \"students\", \"predicate\": \"apply\", \"object\": \"tolerances\"},\n {\"subject\": \"students\", \"predicate\": \"prepare files for\", \"object\": \"output\"},\n {\"subject\": \"students\", \"predicate\": \"use\", \"object\": \"critical thinking\"},\n {\"subject\": \"students\", \"predicate\": \"use\", \"object\": \"problem solving skills\"},\n {\"subject\": \"students\", \"predicate\": \"use\", \"object\": \"research methods\"},\n {\"subject\": \"course\", \"predicate\": \"requires\", \"object\": \"Rhino Level 1 Training Guide\"},\n {\"s",
|
||||
"error": null
|
||||
},
|
||||
"_type": "syllabus"
|
||||
},
|
||||
{
|
||||
"source": "1119345.pdf",
|
||||
"size_bucket": "medium",
|
||||
"doc_chars_original": 1745,
|
||||
"doc_chars_sent": 1745,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 807,
|
||||
"output_tokens": 1434,
|
||||
"latency_s": 6.86,
|
||||
"metrics": {
|
||||
"n_entities": 33,
|
||||
"n_edges": 21,
|
||||
"predicate_diversity": 21,
|
||||
"type_diversity": 25,
|
||||
"avg_degree": 1.27,
|
||||
"largest_component": 15,
|
||||
"largest_component_pct": 45.5
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Stratasys\", \"type\": \"Company\"},\n {\"name\": \"7665 Commerce Way Eden Prairie, MN 55344\", \"type\": \"Address\"},\n {\"name\": \"1.800.801.6491\", \"type\": \"Phone Number\"},\n {\"name\": \"http://www.stratasys.com\", \"type\": \"Website\"},\n {\"name\": \"Invoice 1119345\", \"type\": \"Invoice\"},\n {\"name\": \"01-JUN-16\", \"type\": \"Date\"},\n {\"name\": \"SUNY New Paltz\", \"type\": \"Organization\"},\n {\"name\": \"Smiley Arts Building 100 1 Hawk Dr New Paltz, NY, 12561-2447\", \"type\": \"Address\"},\n {\"name\": \"Stratasys US OU\", \"type\": \"Organization\"},\n {\"name\": \"AARON NELSON\", \"type\": \"Person\"},\n {\"name\": \"1109911\", \"type\": \"Order Number\"},\n {\"name\": \"CC-AARON NELSON-9121\", \"type\": \"Purchase Order\"},\n {\"name\": \"USD\", \"type\": \"Currency\"},\n {\"name\": \"08-JUN-16\", \"type\": \"Date\"},\n {\"name\": \"8326568\", \"type\": \"Internal ID\"},\n {\"name\": \"OBJ-03327\", \"type\": \"Part Number\"},\n {\"name\": \"PACK OF 1 RGD837, VERO PUREWHITE, 3.6KG\", \"type\": \"Product\"},\n {\"name\": \"13406-03327\", \"type\": \"Serial/Lot Number\"},\n {\"name\": \"15-JUN-17\", \"type\": \"Expiration Date\"},\n {\"name\": \"521.00\", \"type\": \"Price\"},\n {\"name\": \"3888472\", \"type\": \"Delivery Number\"},\n {\"name\": \"UPS Ground\", \"type\": \"Shipping Method\"},\n {\"name\": \"accounts.receivable@stratasys.com\", \"type\": \"Email Address\"},\n {\"name\": \"Stratasys, Inc\", \"type\": \"Company\"},\n {\"name\": \"28043 Network Place Chicago, IL 60673-1280\", \"type\": \"Address\"},\n {\"name\": \"JP Morgan Chase Bank\", \"type\": \"Bank\"},\n {\"name\": \"021000021\", \"type\": \"Routing Number\"},\n {\"name\": \"601551695\", \"type\": \"Account Number\"},\n {\"name\": \"CHASUS33\", \"type\": \"SWIFT Code\"},\n {\"name\": \"GB02CHAS60924241287679\", \"type\": \"IBAN\"},\n {\"name\": \"CHASGB2L\", \"type\": \"SWIFT Code\"},\n {\"name\": \"41287679\", \"type\": \"Account Number\"},\n {\"name\": \"124001545\", \"type\": \"Routing Number\"}\n ],\n \"edges\": [\n {\"subject\": \"Stratasys\", \"predicate\": \"is located at\", \"object\": \"7665 Commerce Way Eden Prairie, MN 55344\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"has phone number\", \"object\": \"1.800.801.6491\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"has website\", \"object\": \"http://www.stratasys.com\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"issued on\", \"object\": \"01-JUN-16\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"billed to\", \"object\": \"SUNY New Paltz\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"shipped to\", \"object\": \"Smiley Arts Building 100 1 Hawk Dr New Paltz, NY, 12561-2447\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"issued by\", \"object\": \"Stratasys US OU\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"references purchase order\", \"object\": \"CC-AARON NELSON-9121\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"due on\", \"object\": \"08-JUN-16\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"contains line item\", \"object\": \"OBJ-03327\"},\n {\"subject\": \"OBJ-03327\", \"predicate\": \"describes product\", \"object\": \"PACK OF 1 RGD837, VERO PUREWHITE, 3.6KG\"},\n {\"subject\": \"OBJ-03327\", \"predicate\": \"has serial lot number\", \"object\": \"13406-03327\"},\n {\"subject\": \"OBJ-03327\", \"predicate\": \"has expiration date\", \"object\": \"15-JUN-17\"},\n {\"subject\": \"OBJ-03327\", \"predicate\": \"has unit price\", \"object\": \"521.00\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"total amount\", \"object\": \"521.00\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"shipped via\", \"object\": \"UPS Ground\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"has delivery number\", \"object\": \"3888472\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"shipped on\", \"object\": \"01-JUN-16\"},\n {\"subject\": \"AARON NELSON\", \"predicate\": \"associated with\", \"object\": \"CC-AARON NELSON-9121\"},\n {\"subject\": \"Stratasys, Inc\", \"predicate\": \"located at\", \"object\": \"28043 Network Place Chicago, IL 60673-1280\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"accepts payment via\", \"object\": \"JP Morgan Chase Bank\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 39.42,
|
||||
"local_metadata": {
|
||||
"language": "en",
|
||||
"char_length": 780,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": false,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": true
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "administrative",
|
||||
"one_sentence_summary": "Invoice for a purchase order from Stratasys US OU to SUNY New Paltz for a product named OBJ-03327"
|
||||
},
|
||||
"local_raw": "{\n \"language\": \"en\",\n \"char_length\": 780,\n \"primary_format\": \"prose\",\n \"structural_signals\": {\n \"has_headings\": false,\n \"has_bullet_lists\": false,\n \"has_numbered_lists\": false,\n \"has_tables\": false,\n \"has_code_blocks\": false,\n \"has_dates\": true\n },\n \"content_signals\": {\n \"has_named_people\": false,\n \"has_institutional_language\": true,\n \"has_technical_terminology\": true,\n \"has_first_person\": false,\n \"has_quotations\": false\n },\n \"domain_class\": \"administrative\",\n \"one_sentence_summary\": \"Invoice for a purchase order from Stratasys US OU to SUNY New Paltz for a product named OBJ-03327\"\n}",
|
||||
"api_input_tokens": 1082,
|
||||
"api_output_tokens": 1626,
|
||||
"api_latency_s": 7.53,
|
||||
"metrics": {
|
||||
"n_entities": 29,
|
||||
"n_edges": 31,
|
||||
"predicate_diversity": 22,
|
||||
"type_diversity": 17,
|
||||
"avg_degree": 2.14,
|
||||
"largest_component": 20,
|
||||
"largest_component_pct": 69.0
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Stratasys US OU\", \"type\": \"Organization\"},\n {\"name\": \"SUNY New Paltz\", \"type\": \"Organization\"},\n {\"name\": \"Invoice 1119345\", \"type\": \"Document\"},\n {\"name\": \"Purchase Order 1109911\", \"type\": \"Document\"},\n {\"name\": \"OBJ-03327\", \"type\": \"Product\"},\n {\"name\": \"PACK OF 1 RGD837, VERO PUREWHITE, 3.6KG\", \"type\": \"Product Description\"},\n {\"name\": \"Smiley Arts Building 100\", \"type\": \"Location\"},\n {\"name\": \"1 Hawk Dr\", \"type\": \"Address\"},\n {\"name\": \"New Paltz, NY, 12561-2447\", \"type\": \"Location\"},\n {\"name\": \"United States\", \"type\": \"Country\"},\n {\"name\": \"7665 Commerce Way\", \"type\": \"Address\"},\n {\"name\": \"Eden Prairie, MN 55344\", \"type\": \"Location\"},\n {\"name\": \"01-JUN-16\", \"type\": \"Date\"},\n {\"name\": \"08-JUN-16\", \"type\": \"Date\"},\n {\"name\": \"15-JUN-17\", \"type\": \"Date\"},\n {\"name\": \"USD\", \"type\": \"Currency\"},\n {\"name\": \"CC\", \"type\": \"Payment Terms\"},\n {\"name\": \"UPS Ground\", \"type\": \"Shipping Method\"},\n {\"name\": \"Delivery Number 3888472\", \"type\": \"Shipment\"},\n {\"name\": \"Serial/Lot 13406-03327\", \"type\": \"Identifier\"},\n {\"name\": \"accounts.receivable@stratasys.com\", \"type\": \"Email Address\"},\n {\"name\": \"Stratasys, Inc\", \"type\": \"Organization\"},\n {\"name\": \"28043 Network Place\", \"type\": \"Address\"},\n {\"name\": \"Chicago, IL 60673-1280\", \"type\": \"Location\"},\n {\"name\": \"JP Morgan Chase Bank\", \"type\": \"Financial Institution\"},\n {\"name\": \"CHASUS33\", \"type\": \"SWIFT Code\"},\n {\"name\": \"CHASGB2L\", \"type\": \"SWIFT Code\"},\n {\"name\": \"CC-AARON NELSON-9121\", \"type\": \"Account Reference\"},\n {\"name\": \"Internal ID 8326568\", \"type\": \"Identifier\"}\n ],\n \"edges\": [\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"issued by\", \"object\": \"Stratasys US OU\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"billed to\", \"object\": \"SUNY New Paltz\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"issued on\", \"object\": \"01-JUN-16\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"references\", \"object\": \"Purchase Order 1109911\"},\n {\"subject\": \"Purchase Order 1109911\", \"predicate\": \"placed by\", \"object\": \"SUNY New Paltz\"},\n {\"subject\": \"Purchase Order 1109911\", \"predicate\": \"placed with\", \"object\": \"Stratasys US OU\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"contains line item\", \"object\": \"OBJ-03327\"},\n {\"subject\": \"OBJ-03327\", \"predicate\": \"has description\", \"object\": \"PACK OF 1 RGD837, VERO PUREWHITE, 3.6KG\"},\n {\"subject\": \"OBJ-03327\", \"predicate\": \"has unit price\", \"object\": \"USD 521.00\"},\n {\"subject\": \"OBJ-03327\", \"predicate\": \"has serial lot\", \"object\": \"Serial/Lot 13406-03327\"},\n {\"subject\": \"Serial/Lot 13406-03327\", \"predicate\": \"expires on\", \"object\": \"15-JUN-17\"},\n {\"subject\": \"SUNY New Paltz\", \"predicate\": \"located at\", \"object\": \"Smiley Arts Building 100\"},\n {\"subject\": \"Smiley Arts Building 100\", \"predicate\": \"located at\", \"object\": \"1 Hawk Dr\"},\n {\"subject\": \"1 Hawk Dr\", \"predicate\": \"located in\", \"object\": \"New Paltz, NY, 12561-2447\"},\n {\"subject\": \"New Paltz, NY, 12561-2447\", \"predicate\": \"located in\", \"object\": \"United States\"},\n {\"subject\": \"Stratasys US OU\", \"predicate\": \"located at\", \"object\": \"7665 Commerce Way\"},\n {\"subject\": \"7665 Commerce Way\", \"predicate\": \"located in\", \"object\": \"Eden Prairie, MN 55344\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"due on\", \"object\": \"08-JUN-16\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"payment terms\", \"object\": \"CC\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"currency\", \"object\": \"USD\"},\n {\"subject\": \"Invoice 1119345\", \"predicate\": \"total amount\", \"object\": \"521.00 USD\"},\n {\"subject\": \"Delivery Number 3888472\", \"predicate\": \"shipped on\", \"object\": \"01-JUN-16\"},\n {\"subject\": \"Delivery Number 3888472\", \"predicate\": \"shipped via\", \"object\": \"UPS Ground\"},\n {\"subject\": \"Delivery Number 3888472\", \"predicate\": \"associated with\", \"object\": \"Invoice 1119345\"},\n {\"subject\": \"Stratasys, Inc\", \"predicate\": \"accepts payment method",
|
||||
"error": null
|
||||
},
|
||||
"_type": "other"
|
||||
},
|
||||
{
|
||||
"source": "06_PPT_ALL_Design Considerations_From CAD to CAM_v3.pptx",
|
||||
"size_bucket": "large",
|
||||
"doc_chars_original": 21383,
|
||||
"doc_chars_sent": 12000,
|
||||
"truncated": true,
|
||||
"condition_a": {
|
||||
"input_tokens": 2951,
|
||||
"output_tokens": 2424,
|
||||
"latency_s": 13.83,
|
||||
"metrics": {
|
||||
"n_entities": 55,
|
||||
"n_edges": 54,
|
||||
"predicate_diversity": 33,
|
||||
"type_diversity": 26,
|
||||
"avg_degree": 1.96,
|
||||
"largest_component": 13,
|
||||
"largest_component_pct": 23.6
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"3D Printing\", \"type\": \"Technology\"},\n {\"name\": \"CAD\", \"type\": \"Software\"},\n {\"name\": \"CAM\", \"type\": \"Software\"},\n {\"name\": \"Slicing\", \"type\": \"Process\"},\n {\"name\": \"Support\", \"type\": \"Design Element\"},\n {\"name\": \"Part Density\", \"type\": \"Design Parameter\"},\n {\"name\": \"Wall Thickness\", \"type\": \"Design Parameter\"},\n {\"name\": \"Tolerance\", \"type\": \"Design Parameter\"},\n {\"name\": \"Support Removal\", \"type\": \"Process\"},\n {\"name\": \"Infill\", \"type\": \"Design Parameter\"},\n {\"name\": \"Mesh\", \"type\": \"Data Structure\"},\n {\"name\": \"STL\", \"type\": \"File Format\"},\n {\"name\": \"ASCII\", \"type\": \"File Format\"},\n {\"name\": \"Binary\", \"type\": \"File Format\"},\n {\"name\": \"FDM\", \"type\": \"3D Printing Technology\"},\n {\"name\": \"PolyJet\", \"type\": \"3D Printing Technology\"},\n {\"name\": \"J750\", \"type\": \"3D Printer\"},\n {\"name\": \"Texture Mapping\", \"type\": \"Design Technique\"},\n {\"name\": \"Nozzle Diameter\", \"type\": \"Machine Parameter\"},\n {\"name\": \"Droplet Size\", \"type\": \"Machine Parameter\"},\n {\"name\": \"Material Rheology\", \"type\": \"Material Property\"},\n {\"name\": \"Clearance\", \"type\": \"Design Parameter\"},\n {\"name\": \"XY Clearance\", \"type\": \"Design Parameter\"},\n {\"name\": \"Z Clearance\", \"type\": \"Design Parameter\"},\n {\"name\": \"Accuracy\", \"type\": \"Measurement Property\"},\n {\"name\": \"Precision\", \"type\": \"Measurement Property\"},\n {\"name\": \"Resolution\", \"type\": \"Machine Property\"},\n {\"name\": \"Repeatability\", \"type\": \"Machine Property\"},\n {\"name\": \"Rheology\", \"type\": \"Material Science\"},\n {\"name\": \"Photopolymer\", \"type\": \"Material\"},\n {\"name\": \"Thermoplastic\", \"type\": \"Material\"},\n {\"name\": \"Temperature Gradients\", \"type\": \"Environmental Factor\"},\n {\"name\": \"Vibrations\", \"type\": \"Environmental Factor\"},\n {\"name\": \"Deflections\", \"type\": \"Environmental Factor\"},\n {\"name\": \"Fortus 450 MC\", \"type\": \"3D Printer\"},\n {\"name\": \"F123\", \"type\": \"3D Printer\"},\n {\"name\": \"Stratasys PolyJet\", \"type\": \"3D Printer\"},\n {\"name\": \"Layer Thickness\", \"type\": \"Design Parameter\"},\n {\"name\": \"Minimum Wall Thickness\", \"type\": \"Design Parameter\"},\n {\"name\": \"Minimum Feature Size\", \"type\": \"Design Parameter\"},\n {\"name\": \"Breakaway Support\", \"type\": \"Support Type\"},\n {\"name\": \"Soluble Support\", \"type\": \"Support Type\"},\n {\"name\": \"Post Processing\", \"type\": \"Process\"},\n {\"name\": \"File Conversion\", \"type\": \"Process\"},\n {\"name\": \"Overhangs\", \"type\": \"Design Challenge\"},\n {\"name\": \"Closed Systems\", \"type\": \"Design Type\"},\n {\"name\": \"GrabCAD\", \"type\": \"Design Repository\"},\n {\"name\": \"Euclidean Space\", \"type\": \"Mathematical Concept\"},\n {\"name\": \"Point\", \"type\": \"Geometric Element\"},\n {\"name\": \"Triangle\", \"type\": \"Geometric Element\"},\n {\"name\": \"Curved Surfaces\", \"type\": \"Geometric Feature\"},\n {\"name\": \"End Use\", \"type\": \"Design Consideration\"},\n {\"name\": \"Optimal Materials\", \"type\": \"Design Consideration\"},\n {\"name\": \"Tools or Technology\", \"type\": \"Design Consideration\"},\n {\"name\": \"Additive Manufacturing\", \"type\": \"Manufacturing Process\"}\n ],\n \"edges\": [\n {\"subject\": \"3D Printing\", \"predicate\": \"uses\", \"object\": \"Slicing\"},\n {\"subject\": \"3D Printing\", \"predicate\": \"requires\", \"object\": \"Support\"},\n {\"subject\": \"3D Printing\", \"predicate\": \"involves\", \"object\": \"Part Density\"},\n {\"subject\": \"CAD\", \"predicate\": \"converts to\", \"object\": \"CAM\"},\n {\"subject\": \"CAD\", \"predicate\": \"exports to\", \"object\": \"STL\"},\n {\"subject\": \"STL\", \"predicate\": \"is a type of\", \"object\": \"Mesh\"},\n {\"subject\": \"STL\", \"predicate\": \"can be\", \"object\": \"ASCII\"},\n {\"subject\": \"STL\", \"predicate\": \"can be\", \"object\": \"Binary\"},\n {\"subject\": \"FDM\", \"predicate\": \"has design considerations for\", \"object\": \"Wall Thickness\"},\n {\"subject\": \"PolyJet\", \"predicate\": \"has design considerations for\", \"object\": \"Wall Thickness\"},\n {\"subject\": \"J750\", \"predicate\": \"supports\", \"object\": \"Texture Mapping\"},\n {\"subject\": \"Mesh\", \"p",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 129.91,
|
||||
"local_metadata": {
|
||||
"language": "en",
|
||||
"char_length": 2834,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "technical",
|
||||
"one_sentence_summary": "Document discusses design considerations for 3D printing, including slicing, support, part density, wall thickness, tolerance, clearance, infill, mesh, STL, ASCII and Binary files, and design considerations for FDM & PolyJet."
|
||||
},
|
||||
"local_raw": "{\n \"language\": \"en\",\n \"char_length\": 2834,\n \"primary_format\": \"prose\",\n \"structural_signals\": {\n \"has_headings\": true,\n \"has_bullet_lists\": false,\n \"has_numbered_lists\": false,\n \"has_tables\": false,\n \"has_code_blocks\": false,\n \"has_dates\": false\n },\n \"content_signals\": {\n \"has_named_people\": false,\n \"has_institutional_language\": true,\n \"has_technical_terminology\": true,\n \"has_first_person\": false,\n \"has_quotations\": false\n },\n \"domain_class\": \"technical\",\n \"one_sentence_summary\": \"Document discusses design considerations for 3D printing, including slicing, support, part density, wall thickness, tolerance, clearance, infill, mesh, STL, ASCII and Binary files, and design considerations for FDM & PolyJet.\"\n}",
|
||||
"api_input_tokens": 3249,
|
||||
"api_output_tokens": 4347,
|
||||
"api_latency_s": 20.38,
|
||||
"metrics": {
|
||||
"n_entities": 118,
|
||||
"n_edges": 93,
|
||||
"predicate_diversity": 51,
|
||||
"type_diversity": 42,
|
||||
"avg_degree": 1.58,
|
||||
"largest_component": 21,
|
||||
"largest_component_pct": 17.8
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"3D printing\", \"type\": \"technology\"},\n {\"name\": \"slicing\", \"type\": \"process\"},\n {\"name\": \"support\", \"type\": \"component\"},\n {\"name\": \"part density\", \"type\": \"property\"},\n {\"name\": \"wall thickness\", \"type\": \"property\"},\n {\"name\": \"tolerance\", \"type\": \"property\"},\n {\"name\": \"clearance\", \"type\": \"property\"},\n {\"name\": \"infill\", \"type\": \"component\"},\n {\"name\": \"mesh\", \"type\": \"data_structure\"},\n {\"name\": \"STL\", \"type\": \"file_format\"},\n {\"name\": \"ASCII\", \"type\": \"file_format\"},\n {\"name\": \"Binary\", \"type\": \"file_format\"},\n {\"name\": \"FDM\", \"type\": \"printing_technology\"},\n {\"name\": \"PolyJet\", \"type\": \"printing_technology\"},\n {\"name\": \"CAD\", \"type\": \"software_type\"},\n {\"name\": \"CAM\", \"type\": \"software_type\"},\n {\"name\": \"nozzle diameter\", \"type\": \"property\"},\n {\"name\": \"droplet size\", \"type\": \"property\"},\n {\"name\": \"material rheology\", \"type\": \"property\"},\n {\"name\": \"minimum printable feature size\", \"type\": \"property\"},\n {\"name\": \"minimum usable feature size\", \"type\": \"property\"},\n {\"name\": \"XY clearance\", \"type\": \"property\"},\n {\"name\": \"Z clearance\", \"type\": \"property\"},\n {\"name\": \"layer thickness\", \"type\": \"property\"},\n {\"name\": \"support material\", \"type\": \"material\"},\n {\"name\": \"accuracy\", \"type\": \"property\"},\n {\"name\": \"resolution\", \"type\": \"property\"},\n {\"name\": \"repeatability\", \"type\": \"property\"},\n {\"name\": \"precision\", \"type\": \"property\"},\n {\"name\": \"temperature gradients\", \"type\": \"factor\"},\n {\"name\": \"vibrations\", \"type\": \"factor\"},\n {\"name\": \"deflections\", \"type\": \"factor\"},\n {\"name\": \"photopolymer\", \"type\": \"material\"},\n {\"name\": \"thermoplastic\", \"type\": \"material\"},\n {\"name\": \"polymerization\", \"type\": \"process\"},\n {\"name\": \"solidification\", \"type\": \"process\"},\n {\"name\": \"cooling\", \"type\": \"process\"},\n {\"name\": \"UV light\", \"type\": \"technology\"},\n {\"name\": \"machine specifications\", \"type\": \"documentation\"},\n {\"name\": \"Fortus 450 MC\", \"type\": \"printer_model\"},\n {\"name\": \"F123\", \"type\": \"printer_model\"},\n {\"name\": \"Stratasys PolyJet\", \"type\": \"printer_model\"},\n {\"name\": \"J750\", \"type\": \"printer_model\"},\n {\"name\": \"end use\", \"type\": \"design_consideration\"},\n {\"name\": \"post processing\", \"type\": \"process\"},\n {\"name\": \"optimal materials\", \"type\": \"design_consideration\"},\n {\"name\": \"toolpaths\", \"type\": \"output\"},\n {\"name\": \"overhangs\", \"type\": \"structure\"},\n {\"name\": \"closed systems\", \"type\": \"design_type\"},\n {\"name\": \"moving parts\", \"type\": \"component\"},\n {\"name\": \"fusing\", \"type\": \"process\"},\n {\"name\": \"axes of motion\", \"type\": \"property\"},\n {\"name\": \"mesh file\", \"type\": \"file_type\"},\n {\"name\": \"machine accuracy\", \"type\": \"property\"},\n {\"name\": \"control system\", \"type\": \"component\"},\n {\"name\": \"deviations\", \"type\": \"property\"},\n {\"name\": \"mechanical capability\", \"type\": \"property\"},\n {\"name\": \"fluid properties\", \"type\": \"property\"},\n {\"name\": \"micron\", \"type\": \"unit\"},\n {\"name\": \"measurement increments\", \"type\": \"property\"},\n {\"name\": \"nozzle\", \"type\": \"component\"},\n {\"name\": \"motion increments\", \"type\": \"property\"},\n {\"name\": \"material flow\", \"type\": \"process\"},\n {\"name\": \"non-random\", \"type\": \"property\"},\n {\"name\": \"liquids\", \"type\": \"material_state\"},\n {\"name\": \"soft solids\", \"type\": \"material_state\"},\n {\"name\": \"melted thermoplastic\", \"type\": \"material\"},\n {\"name\": \"droplet\", \"type\": \"component\"},\n {\"name\": \"shape change\", \"type\": \"process\"},\n {\"name\": \"commercial 3D printers\", \"type\": \"printer_category\"},\n {\"name\": \"hobbyist 3D printers\", \"type\": \"printer_category\"},\n {\"name\": \"material properties\", \"type\": \"property\"},\n {\"name\": \"part quality\", \"type\": \"property\"},\n {\"name\": \"Z-direction\", \"type\": \"axis\"},\n {\"name\": \"X direction\", \"type\": \"axis\"},\n {\"name\": \"Y direction\", \"type\": \"axis\"},\n {\"name\": \"XY plane\", \"type\": \"plane\"},\n {\"name\": \"Z height\", \"type\": \"me",
|
||||
"error": null
|
||||
},
|
||||
"_type": "course_ppt"
|
||||
},
|
||||
{
|
||||
"source": "2020 - DDF 220 - Intro to Computational Media Syllabus.pdf",
|
||||
"size_bucket": "large",
|
||||
"doc_chars_original": 18696,
|
||||
"doc_chars_sent": 12000,
|
||||
"truncated": true,
|
||||
"condition_a": {
|
||||
"input_tokens": 2857,
|
||||
"output_tokens": 1824,
|
||||
"latency_s": 8.63,
|
||||
"metrics": {
|
||||
"n_entities": 42,
|
||||
"n_edges": 38,
|
||||
"predicate_diversity": 20,
|
||||
"type_diversity": 25,
|
||||
"avg_degree": 1.81,
|
||||
"largest_component": 36,
|
||||
"largest_component_pct": 85.7
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Intro to Computational Media\", \"type\": \"Course\"},\n {\"name\": \"DDF 220-01\", \"type\": \"Course Code\"},\n {\"name\": \"Aaron Nelson\", \"type\": \"Instructor\"},\n {\"name\": \"Fall 2020\", \"type\": \"Semester\"},\n {\"name\": \"EIH 104\", \"type\": \"Classroom\"},\n {\"name\": \"HVAMC\", \"type\": \"Building\"},\n {\"name\": \"Processing\", \"type\": \"Programming Language\"},\n {\"name\": \"Java\", \"type\": \"Programming Language\"},\n {\"name\": \"Python\", \"type\": \"Programming Language\"},\n {\"name\": \"Grasshopper\", \"type\": \"Software\"},\n {\"name\": \"Rhino\", \"type\": \"Software\"},\n {\"name\": \"Blackboard\", \"type\": \"Learning Management System\"},\n {\"name\": \"Learning Processing\", \"type\": \"Textbook\"},\n {\"name\": \"Daniel Shiffman\", \"type\": \"Author\"},\n {\"name\": \"Morgan Kaufmann\", \"type\": \"Publisher\"},\n {\"name\": \"Algorithm Aided Design\", \"type\": \"Textbook\"},\n {\"name\": \"Arturo Tedeschi\", \"type\": \"Author\"},\n {\"name\": \"variables\", \"type\": \"Programming Concept\"},\n {\"name\": \"conditionals\", \"type\": \"Programming Concept\"},\n {\"name\": \"loops\", \"type\": \"Programming Concept\"},\n {\"name\": \"iteration\", \"type\": \"Programming Concept\"},\n {\"name\": \"image processing\", \"type\": \"Topic\"},\n {\"name\": \"interactivity\", \"type\": \"Topic\"},\n {\"name\": \"3D algorithmic modeling\", \"type\": \"Topic\"},\n {\"name\": \"Homeworks\", \"type\": \"Assignment\"},\n {\"name\": \"Mid Semester Project\", \"type\": \"Assignment\"},\n {\"name\": \"Final Project\", \"type\": \"Assignment\"},\n {\"name\": \"Participation\", \"type\": \"Grading Component\"},\n {\"name\": \"October 30th\", \"type\": \"Date\"},\n {\"name\": \"Disability Resource Center\", \"type\": \"Institution\"},\n {\"name\": \"Student Union\", \"type\": \"Location\"},\n {\"name\": \"Office of Veteran and Military Services\", \"type\": \"Institution\"},\n {\"name\": \"Title IX Office\", \"type\": \"Institution\"},\n {\"name\": \"Sojourner Truth Library\", \"type\": \"Library\"},\n {\"name\": \"SUNY New Paltz\", \"type\": \"University\"},\n {\"name\": \"electronic arts\", \"type\": \"Discipline\"},\n {\"name\": \"visual design\", \"type\": \"Discipline\"},\n {\"name\": \"new media art\", \"type\": \"Discipline\"},\n {\"name\": \"2D Pixel Processing\", \"type\": \"Project Type\"},\n {\"name\": \"3D design\", \"type\": \"Design Type\"},\n {\"name\": \"generative works\", \"type\": \"Art Form\"},\n {\"name\": \"interactive works\", \"type\": \"Art Form\"}\n ],\n \"edges\": [\n {\"subject\": \"Intro to Computational Media\", \"predicate\": \"has course code\", \"object\": \"DDF 220-01\"},\n {\"subject\": \"Intro to Computational Media\", \"predicate\": \"is taught by\", \"object\": \"Aaron Nelson\"},\n {\"subject\": \"Intro to Computational Media\", \"predicate\": \"is offered in\", \"object\": \"Fall 2020\"},\n {\"subject\": \"Intro to Computational Media\", \"predicate\": \"meets in\", \"object\": \"EIH 104\"},\n {\"subject\": \"EIH 104\", \"predicate\": \"is located in\", \"object\": \"HVAMC\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"has office hours\", \"object\": \"M-R 10am \u2013 12pm\"},\n {\"subject\": \"Intro to Computational Media\", \"predicate\": \"teaches\", \"object\": \"Processing\"},\n {\"subject\": \"Processing\", \"predicate\": \"is built on\", \"object\": \"Java\"},\n {\"subject\": \"Intro to Computational Media\", \"predicate\": \"explores\", \"object\": \"Python\"},\n {\"subject\": \"Intro to Computational Media\", \"predicate\": \"explores\", \"object\": \"Grasshopper\"},\n {\"subject\": \"Intro to Computational Media\", \"predicate\": \"teaches\", \"object\": \"Rhino\"},\n {\"subject\": \"Processing\", \"predicate\": \"is used for\", \"object\": \"electronic arts\"},\n {\"subject\": \"Processing\", \"predicate\": \"is used for\", \"object\": \"visual design\"},\n {\"subject\": \"Processing\", \"predicate\": \"is used for\", \"object\": \"new media art\"},\n {\"subject\": \"Intro to Computational Media\", \"predicate\": \"covers\", \"object\": \"variables\"},\n {\"subject\": \"Intro to Computational Media\", \"predicate\": \"covers\", \"object\": \"conditionals\"},\n {\"subject\": \"Intro to Computational Media\", \"predicate\": \"covers\", \"object\": \"loops\"},\n {\"subject\": \"Intro to Computational Media\", \"predicate\": \"cover",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 117.37,
|
||||
"local_metadata": {
|
||||
"language": "en",
|
||||
"char_length": 3246,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": true,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "Introduction to Computational Media course for programming in digital arts with Processing and Rhino"
|
||||
},
|
||||
"local_raw": "{\n \"language\": \"en\",\n \"char_length\": 3246,\n \"primary_format\": \"prose\",\n \"structural_signals\": {\n \"has_headings\": true,\n \"has_bullet_lists\": false,\n \"has_numbered_lists\": false,\n \"has_tables\": false,\n \"has_code_blocks\": false,\n \"has_dates\": false\n },\n \"content_signals\": {\n \"has_named_people\": true,\n \"has_institutional_language\": true,\n \"has_technical_terminology\": true,\n \"has_first_person\": false,\n \"has_quotations\": false\n },\n \"domain_class\": \"educational\",\n \"one_sentence_summary\": \"Introduction to Computational Media course for programming in digital arts with Processing and Rhino\"\n}",
|
||||
"api_input_tokens": 3119,
|
||||
"api_output_tokens": 3226,
|
||||
"api_latency_s": 15.53,
|
||||
"metrics": {
|
||||
"n_entities": 93,
|
||||
"n_edges": 51,
|
||||
"predicate_diversity": 30,
|
||||
"type_diversity": 24,
|
||||
"avg_degree": 1.1,
|
||||
"largest_component": 31,
|
||||
"largest_component_pct": 33.3
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Intro to Computational Media\", \"type\": \"Course\"},\n {\"name\": \"DDF 220-01\", \"type\": \"Course Code\"},\n {\"name\": \"Aaron Nelson\", \"type\": \"Person\"},\n {\"name\": \"EIH 104\", \"type\": \"Location\"},\n {\"name\": \"HVAMC\", \"type\": \"Location\"},\n {\"name\": \"Processing\", \"type\": \"Programming Language\"},\n {\"name\": \"Java\", \"type\": \"Programming Language\"},\n {\"name\": \"Python\", \"type\": \"Programming Language\"},\n {\"name\": \"Grasshopper\", \"type\": \"Software\"},\n {\"name\": \"Rhino\", \"type\": \"Software\"},\n {\"name\": \"Blackboard\", \"type\": \"Learning Management System\"},\n {\"name\": \"Learning Processing\", \"type\": \"Textbook\"},\n {\"name\": \"Daniel Shiffman\", \"type\": \"Person\"},\n {\"name\": \"Morgan Kaufmann\", \"type\": \"Publisher\"},\n {\"name\": \"Algorithm Aided Design\", \"type\": \"Textbook\"},\n {\"name\": \"Arturo Tedeschi\", \"type\": \"Person\"},\n {\"name\": \"variables\", \"type\": \"Programming Concept\"},\n {\"name\": \"conditionals\", \"type\": \"Programming Concept\"},\n {\"name\": \"loops\", \"type\": \"Programming Concept\"},\n {\"name\": \"iteration\", \"type\": \"Programming Concept\"},\n {\"name\": \"image processing\", \"type\": \"Topic\"},\n {\"name\": \"interactivity\", \"type\": \"Topic\"},\n {\"name\": \"3D algorithmic modeling\", \"type\": \"Topic\"},\n {\"name\": \"digital arts\", \"type\": \"Field\"},\n {\"name\": \"visual design\", \"type\": \"Field\"},\n {\"name\": \"new media art\", \"type\": \"Field\"},\n {\"name\": \"electronic arts\", \"type\": \"Field\"},\n {\"name\": \"Homeworks\", \"type\": \"Assignment\"},\n {\"name\": \"Mid Semester Project and Presentation\", \"type\": \"Assignment\"},\n {\"name\": \"Final Project\", \"type\": \"Assignment\"},\n {\"name\": \"Participation\", \"type\": \"Grading Component\"},\n {\"name\": \"2D Pixel Processing Program\", \"type\": \"Assignment\"},\n {\"name\": \"Printed Layout\", \"type\": \"Assignment\"},\n {\"name\": \"Boolean Quiz\", \"type\": \"Assignment\"},\n {\"name\": \"Simple UI Looping\", \"type\": \"Assignment\"},\n {\"name\": \"2D Attractor System\", \"type\": \"Assignment\"},\n {\"name\": \"Sun, Moon, and Earth\", \"type\": \"Assignment\"},\n {\"name\": \"Functions Distance Function\", \"type\": \"Assignment\"},\n {\"name\": \"5 Images\", \"type\": \"Assignment\"},\n {\"name\": \"Image Project\", \"type\": \"Assignment\"},\n {\"name\": \"3D Grid\", \"type\": \"Assignment\"},\n {\"name\": \"3D Attractor System\", \"type\": \"Assignment\"},\n {\"name\": \"Curve Diagrid\", \"type\": \"Assignment\"},\n {\"name\": \"Laser Tests\", \"type\": \"Assignment\"},\n {\"name\": \"Laser Folding Patterns\", \"type\": \"Assignment\"},\n {\"name\": \"3D Image Patterns\", \"type\": \"Assignment\"},\n {\"name\": \"Vase Definitions\", \"type\": \"Assignment\"},\n {\"name\": \"Surface Skinning\", \"type\": \"Assignment\"},\n {\"name\": \"Voronoi and Mesh Subdivision\", \"type\": \"Assignment\"},\n {\"name\": \"Fall 2020\", \"type\": \"Semester\"},\n {\"name\": \"SUNY New Paltz\", \"type\": \"Institution\"},\n {\"name\": \"Disability Resource Center\", \"type\": \"Institution\"},\n {\"name\": \"Deanna Knapp\", \"type\": \"Person\"},\n {\"name\": \"Jean Vizvary\", \"type\": \"Person\"},\n {\"name\": \"Office of Veteran and Military Services\", \"type\": \"Institution\"},\n {\"name\": \"Title IX Office\", \"type\": \"Institution\"},\n {\"name\": \"Office of Human Resources, Diversity & Inclusion\", \"type\": \"Institution\"},\n {\"name\": \"Sojourner Truth Library\", \"type\": \"Institution\"},\n {\"name\": \"Student Evaluation of Instruction\", \"type\": \"Assessment\"},\n {\"name\": \"syntax errors\", \"type\": \"Programming Concept\"},\n {\"name\": \"logic errors\", \"type\": \"Programming Concept\"},\n {\"name\": \"expressions\", \"type\": \"Programming Concept\"},\n {\"name\": \"control structure\", \"type\": \"Programming Concept\"},\n {\"name\": \"algorithms\", \"type\": \"Programming Concept\"},\n {\"name\": \"generative works\", \"type\": \"Art Form\"},\n {\"name\": \"interactive works\", \"type\": \"Art Form\"},\n {\"name\": \"3D design\", \"type\": \"Field\"},\n {\"name\": \"Parametric Representations of Curves\", \"type\": \"Topic\"},\n {\"name\": \"Parametric Representation of Surfaces\", \"type\": \"Topic\"},\n {\"name\": \"Curvature and Developable Surfa",
|
||||
"error": null
|
||||
},
|
||||
"_type": "syllabus"
|
||||
},
|
||||
{
|
||||
"source": "Annual Report - 2020.pdf",
|
||||
"size_bucket": "large",
|
||||
"doc_chars_original": 10110,
|
||||
"doc_chars_sent": 10110,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 2725,
|
||||
"output_tokens": 2405,
|
||||
"latency_s": 14.14,
|
||||
"metrics": {
|
||||
"n_entities": 44,
|
||||
"n_edges": 54,
|
||||
"predicate_diversity": 25,
|
||||
"type_diversity": 23,
|
||||
"avg_degree": 2.45,
|
||||
"largest_component": 43,
|
||||
"largest_component_pct": 97.7
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Aaron Nelson\", \"type\": \"Faculty Member\"},\n {\"name\": \"Assistant Professor\", \"type\": \"Academic Rank\"},\n {\"name\": \"Fine and Performing Arts\", \"type\": \"School/College\"},\n {\"name\": \"Art\", \"type\": \"Department\"},\n {\"name\": \"Making Things Move\", \"type\": \"Course\"},\n {\"name\": \"DDF 310-01\", \"type\": \"Course Code\"},\n {\"name\": \"Design Intents\", \"type\": \"Course\"},\n {\"name\": \"DDF 320-01\", \"type\": \"Course Code\"},\n {\"name\": \"Introduction to Computational Design\", \"type\": \"Course\"},\n {\"name\": \"DDF 220-01\", \"type\": \"Course Code\"},\n {\"name\": \"Material Studies\", \"type\": \"Course\"},\n {\"name\": \"DDF 305-01\", \"type\": \"Course Code\"},\n {\"name\": \"COVID-19 pandemic\", \"type\": \"Event\"},\n {\"name\": \"DDF design curriculum\", \"type\": \"Curriculum\"},\n {\"name\": \"Spring 2020\", \"type\": \"Term\"},\n {\"name\": \"Fall 2020\", \"type\": \"Term\"},\n {\"name\": \"Eisgruber\", \"type\": \"Student\"},\n {\"name\": \"Rabadi\", \"type\": \"Student\"},\n {\"name\": \"Coffin\", \"type\": \"Student\"},\n {\"name\": \"Henix\", \"type\": \"Student\"},\n {\"name\": \"Raber\", \"type\": \"Student\"},\n {\"name\": \"Wong\", \"type\": \"Student\"},\n {\"name\": \"Open Source Powder Printer\", \"type\": \"Student Project\"},\n {\"name\": \"Reale\", \"type\": \"Student\"},\n {\"name\": \"Scribani\", \"type\": \"Student\"},\n {\"name\": \"Rubin\", \"type\": \"Student\"},\n {\"name\": \"Orlando\", \"type\": \"Student\"},\n {\"name\": \"3D Print Dispenser\", \"type\": \"Student Project\"},\n {\"name\": \"College Art Association 2020 Annual Conference\", \"type\": \"Conference\"},\n {\"name\": \"Digital and Haptic: Merging New and Old Technology\", \"type\": \"Presentation\"},\n {\"name\": \"NoVo Foundation\", \"type\": \"Granting Agency\"},\n {\"name\": \"Scoping the Makerspace at the Metro Center\", \"type\": \"Grant Project\"},\n {\"name\": \"IBM\", \"type\": \"Granting Agency\"},\n {\"name\": \"Applications of Injection Molding Using Polymer Molds and Designing for Additive Manufacturing\", \"type\": \"Grant Project\"},\n {\"name\": \"Phase 2 of IBM/New Paltz Additive Manufacturing Collaboration\", \"type\": \"Grant Project\"},\n {\"name\": \"Art Departmental Committee\", \"type\": \"Committee\"},\n {\"name\": \"HVAMC\", \"type\": \"Organization\"},\n {\"name\": \"Face shields design\", \"type\": \"Community Project\"},\n {\"name\": \"College Art Association\", \"type\": \"Professional Membership\"},\n {\"name\": \"Additive Manufacturing and COVID 19\", \"type\": \"Media Appearance\"},\n {\"name\": \"Academic Minute WAMC NPR\", \"type\": \"Media Outlet\"},\n {\"name\": \"DDF Graduate program\", \"type\": \"Academic Program\"},\n {\"name\": \"Tenure\", \"type\": \"Academic Status\"},\n {\"name\": \"Spring 2021\", \"type\": \"Term\"}\n ],\n \"edges\": [\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"holds rank of\", \"object\": \"Assistant Professor\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"works in\", \"object\": \"Fine and Performing Arts\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"works in\", \"object\": \"Art\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"taught\", \"object\": \"Making Things Move\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"taught\", \"object\": \"Design Intents\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"taught\", \"object\": \"Introduction to Computational Design\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"taught\", \"object\": \"Material Studies\"},\n {\"subject\": \"Making Things Move\", \"predicate\": \"has course code\", \"object\": \"DDF 310-01\"},\n {\"subject\": \"Design Intents\", \"predicate\": \"has course code\", \"object\": \"DDF 320-01\"},\n {\"subject\": \"Introduction to Computational Design\", \"predicate\": \"has course code\", \"object\": \"DDF 220-01\"},\n {\"subject\": \"Material Studies\", \"predicate\": \"has course code\", \"object\": \"DDF 305-01\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"led curriculum transition in response to\", \"object\": \"COVID-19 pandemic\"},\n {\"subject\": \"DDF design curriculum\", \"predicate\": \"moved to digital delivery format in\", \"object\": \"Spring 2020\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"redesigned course for online delivery\", \"object",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 117.65,
|
||||
"local_metadata": {
|
||||
"language": "en",
|
||||
"char_length": 1347,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": true
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": true,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": true,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "A faculty annual report detailing Aaron Nelson's academic and creative activities during the reporting period of January 1, 2020 \u2013 December 31, 2020."
|
||||
},
|
||||
"local_raw": "{\n \"language\": \"en\",\n \"char_length\": 1347,\n \"primary_format\": \"prose\",\n \"structural_signals\": {\n \"has_headings\": true,\n \"has_bullet_lists\": false,\n \"has_numbered_lists\": false,\n \"has_tables\": false,\n \"has_code_blocks\": false,\n \"has_dates\": true\n },\n \"content_signals\": {\n \"has_named_people\": true,\n \"has_institutional_language\": true,\n \"has_technical_terminology\": true,\n \"has_first_person\": true,\n \"has_quotations\": false\n },\n \"domain_class\": \"educational\",\n \"one_sentence_summary\": \"A faculty annual report detailing Aaron Nelson's academic and creative activities during the reporting period of January 1, 2020 \u2013 December 31, 2020.\"\n}",
|
||||
"api_input_tokens": 3004,
|
||||
"api_output_tokens": 3093,
|
||||
"api_latency_s": 14.47,
|
||||
"metrics": {
|
||||
"n_entities": 58,
|
||||
"n_edges": 66,
|
||||
"predicate_diversity": 35,
|
||||
"type_diversity": 23,
|
||||
"avg_degree": 2.28,
|
||||
"largest_component": 43,
|
||||
"largest_component_pct": 74.1
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Aaron Nelson\", \"type\": \"Person\"},\n {\"name\": \"Assistant Professor\", \"type\": \"Academic Rank\"},\n {\"name\": \"Fine and Performing Arts\", \"type\": \"School/College\"},\n {\"name\": \"Art\", \"type\": \"Department\"},\n {\"name\": \"Making Things Move\", \"type\": \"Course\"},\n {\"name\": \"DDF 310-01\", \"type\": \"Course Code\"},\n {\"name\": \"Design Intents\", \"type\": \"Course\"},\n {\"name\": \"DDF 320-01\", \"type\": \"Course Code\"},\n {\"name\": \"Introduction to Computational Design\", \"type\": \"Course\"},\n {\"name\": \"DDF 220-01\", \"type\": \"Course Code\"},\n {\"name\": \"Material Studies\", \"type\": \"Course\"},\n {\"name\": \"DDF305-01\", \"type\": \"Course Code\"},\n {\"name\": \"Spring 2020\", \"type\": \"Time Period\"},\n {\"name\": \"Summer 2020\", \"type\": \"Time Period\"},\n {\"name\": \"Fall 2020\", \"type\": \"Time Period\"},\n {\"name\": \"Winter 2020-21\", \"type\": \"Time Period\"},\n {\"name\": \"COVID-19 pandemic\", \"type\": \"Event\"},\n {\"name\": \"DDF design curriculum\", \"type\": \"Curriculum\"},\n {\"name\": \"Eisgruber\", \"type\": \"Person\"},\n {\"name\": \"Rabadi\", \"type\": \"Person\"},\n {\"name\": \"Coffin\", \"type\": \"Person\"},\n {\"name\": \"Henix\", \"type\": \"Person\"},\n {\"name\": \"Raber\", \"type\": \"Person\"},\n {\"name\": \"Wong\", \"type\": \"Person\"},\n {\"name\": \"Open Source Powder Printer\", \"type\": \"Student Work\"},\n {\"name\": \"Reale\", \"type\": \"Person\"},\n {\"name\": \"Scribani\", \"type\": \"Person\"},\n {\"name\": \"Rubin\", \"type\": \"Person\"},\n {\"name\": \"Orlando\", \"type\": \"Person\"},\n {\"name\": \"3D Print Dispenser\", \"type\": \"Student Work\"},\n {\"name\": \"College Art Association 2020 Annual Conference\", \"type\": \"Conference\"},\n {\"name\": \"Digital and Haptic: Merging New and Old Technology\", \"type\": \"Presentation\"},\n {\"name\": \"NoVo Foundation\", \"type\": \"Granting Agency\"},\n {\"name\": \"Scoping the Makerspace at the Metro Center\", \"type\": \"Grant Project\"},\n {\"name\": \"IBM\", \"type\": \"Granting Agency\"},\n {\"name\": \"Applications of Injection Molding Using Polymer Molds and Designing for Additive Manufacturing\", \"type\": \"Grant Project\"},\n {\"name\": \"Phase 2 of IBM/New Paltz Additive Manufacturing Collaboration\", \"type\": \"Grant Project\"},\n {\"name\": \"Art Departmental Committee\", \"type\": \"Committee\"},\n {\"name\": \"Diversity, Equity, and Inclusion Art Departmental Committee\", \"type\": \"Committee\"},\n {\"name\": \"HVAMC\", \"type\": \"Organization\"},\n {\"name\": \"face shields\", \"type\": \"Product\"},\n {\"name\": \"College Art Association\", \"type\": \"Professional Organization\"},\n {\"name\": \"Academic Minute WAMC NPR\", \"type\": \"Media Outlet\"},\n {\"name\": \"Additive Manufacturing and COVID 19\", \"type\": \"Media Appearance\"},\n {\"name\": \"DDF Graduate program\", \"type\": \"Academic Program\"},\n {\"name\": \"January 1, 2020\", \"type\": \"Date\"},\n {\"name\": \"December 31, 2020\", \"type\": \"Date\"},\n {\"name\": \"September 1, 2020\", \"type\": \"Date\"},\n {\"name\": \"January 31, 2021\", \"type\": \"Date\"},\n {\"name\": \"February 1, 2021\", \"type\": \"Date\"},\n {\"name\": \"August 31, 2021\", \"type\": \"Date\"},\n {\"name\": \"August 15, 2020\", \"type\": \"Date\"},\n {\"name\": \"March 14, 2021\", \"type\": \"Date\"},\n {\"name\": \"October 19, 2020\", \"type\": \"Date\"},\n {\"name\": \"April 18, 2021\", \"type\": \"Date\"},\n {\"name\": \"Fall 2022\", \"type\": \"Time Period\"},\n {\"name\": \"Spring\", \"type\": \"Time Period\"},\n {\"name\": \"tenure\", \"type\": \"Academic Status\"}\n ],\n \"edges\": [\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"holds rank of\", \"object\": \"Assistant Professor\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"works in\", \"object\": \"Fine and Performing Arts\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"works in\", \"object\": \"Art\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"taught\", \"object\": \"Making Things Move\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"taught\", \"object\": \"Design Intents\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"taught\", \"object\": \"Introduction to Computational Design\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"taught\", \"object\": \"Material Studies\"},",
|
||||
"error": null
|
||||
},
|
||||
"_type": "faculty_report"
|
||||
},
|
||||
{
|
||||
"source": "Claude: Preparing for dinner with Jim Agutter",
|
||||
"size_bucket": "large",
|
||||
"doc_chars_original": 10053,
|
||||
"doc_chars_sent": 10053,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 2269,
|
||||
"output_tokens": 2516,
|
||||
"latency_s": 14.25,
|
||||
"metrics": {
|
||||
"n_entities": 58,
|
||||
"n_edges": 51,
|
||||
"predicate_diversity": 28,
|
||||
"type_diversity": 24,
|
||||
"avg_degree": 1.76,
|
||||
"largest_component": 36,
|
||||
"largest_component_pct": 62.1
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Jim Agutter\", \"type\": \"Person\"},\n {\"name\": \"Division of Multi-Disciplinary Design (MDD)\", \"type\": \"Academic Program\"},\n {\"name\": \"University of Utah\", \"type\": \"Institution\"},\n {\"name\": \"HVAMC\", \"type\": \"Laboratory/Facility\"},\n {\"name\": \"Information visualization\", \"type\": \"Research Field\"},\n {\"name\": \"Design\", \"type\": \"Research Field\"},\n {\"name\": \"Healthcare data visualization\", \"type\": \"Research Field\"},\n {\"name\": \"Complex systems display\", \"type\": \"Research Field\"},\n {\"name\": \"Computational design\", \"type\": \"Research Field\"},\n {\"name\": \"Innovation Lab\", \"type\": \"Laboratory\"},\n {\"name\": \"College of Architecture + Planning\", \"type\": \"Academic Department\"},\n {\"name\": \"Spark Design Initiative\", \"type\": \"Program\"},\n {\"name\": \"Bachelor of University Studies program\", \"type\": \"Academic Program\"},\n {\"name\": \"Office of Undergraduate Studies\", \"type\": \"Academic Department\"},\n {\"name\": \"Frank Drews\", \"type\": \"Person\"},\n {\"name\": \"Dave Strayer\", \"type\": \"Person\"},\n {\"name\": \"Applied Medical Visualizations (Medvis)\", \"type\": \"Company\"},\n {\"name\": \"GE Healthcare\", \"type\": \"Company\"},\n {\"name\": \"ICU nurse situation awareness\", \"type\": \"Research Topic\"},\n {\"name\": \"Medication management\", \"type\": \"Research Topic\"},\n {\"name\": \"Arterial blood gas data visualization\", \"type\": \"Research Topic\"},\n {\"name\": \"UGuide\", \"type\": \"Software/Tool\"},\n {\"name\": \"AI Community of Practice\", \"type\": \"Organization\"},\n {\"name\": \"Connect2Health\", \"type\": \"Program\"},\n {\"name\": \"Honors Praxis Lab\", \"type\": \"Laboratory\"},\n {\"name\": \"Hope in a Time of Dying\", \"type\": \"Novel\"},\n {\"name\": \"LGBTQ+ fiction\", \"type\": \"Genre\"},\n {\"name\": \"Creative Achievement Award\", \"type\": \"Award\"},\n {\"name\": \"ACSA\", \"type\": \"Organization\"},\n {\"name\": \"University Honors Professorship\", \"type\": \"Award\"},\n {\"name\": \"Early Career Teaching Award\", \"type\": \"Award\"},\n {\"name\": \"Beacon of Excellence Award\", \"type\": \"Award\"},\n {\"name\": \"Distinguished Innovation and Impact Award\", \"type\": \"Award\"},\n {\"name\": \"IBM\", \"type\": \"Company\"},\n {\"name\": \"Braskem\", \"type\": \"Company\"},\n {\"name\": \"Selux\", \"type\": \"Company\"},\n {\"name\": \"Stable Diffusion\", \"type\": \"Technology\"},\n {\"name\": \"Kingston lunette\", \"type\": \"Project\"},\n {\"name\": \"SUNY New Paltz\", \"type\": \"Institution\"},\n {\"name\": \"Grasshopper\", \"type\": \"Software\"},\n {\"name\": \"Python\", \"type\": \"Programming Language\"},\n {\"name\": \"School of Computing\", \"type\": \"Academic Department\"},\n {\"name\": \"School of Medicine\", \"type\": \"Academic Department\"},\n {\"name\": \"College of Engineering\", \"type\": \"Academic Department\"},\n {\"name\": \"FDM\", \"type\": \"Manufacturing Process\"},\n {\"name\": \"PolyJet\", \"type\": \"Manufacturing Process\"},\n {\"name\": \"SLA\", \"type\": \"Manufacturing Process\"},\n {\"name\": \"DMLS\", \"type\": \"Manufacturing Process\"},\n {\"name\": \"CNC\", \"type\": \"Manufacturing Process\"},\n {\"name\": \"Additive manufacturing\", \"type\": \"Manufacturing Field\"},\n {\"name\": \"Heritage preservation\", \"type\": \"Field\"},\n {\"name\": \"Medical device prototyping\", \"type\": \"Research Area\"},\n {\"name\": \"Anatomical models\", \"type\": \"Research Area\"},\n {\"name\": \"Materials science\", \"type\": \"Research Field\"},\n {\"name\": \"Mechanical engineering\", \"type\": \"Research Field\"},\n {\"name\": \"IDSA\", \"type\": \"Organization\"},\n {\"name\": \"Empathic design\", \"type\": \"Design Approach\"},\n {\"name\": \"Chocolate casting project\", \"type\": \"Project\"}\n ],\n \"edges\": [\n {\"subject\": \"Jim Agutter\", \"predicate\": \"is founder of\", \"object\": \"Division of Multi-Disciplinary Design (MDD)\"},\n {\"subject\": \"Jim Agutter\", \"predicate\": \"chairs search committee for\", \"object\": \"MDD\"},\n {\"subject\": \"Division of Multi-Disciplinary Design (MDD)\", \"predicate\": \"is located at\", \"object\": \"University of Utah\"},\n {\"subject\": \"Jim Agutter\", \"predicate\": \"holds position of\", \"object\": \"Senior Associate Dean of Faculty Success ",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 94.7,
|
||||
"local_metadata": {
|
||||
"language": "en",
|
||||
"char_length": 3206,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": true
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": true,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "Discussion between two individuals preparing for a dinner meeting, discussing the background and work of Jim Agutter, a university administrator and designer."
|
||||
},
|
||||
"local_raw": "{\n \"language\": \"en\",\n \"char_length\": 3206,\n \"primary_format\": \"prose\",\n \"structural_signals\": {\n \"has_headings\": true,\n \"has_bullet_lists\": false,\n \"has_numbered_lists\": false,\n \"has_tables\": false,\n \"has_code_blocks\": false,\n \"has_dates\": true\n },\n \"content_signals\": {\n \"has_named_people\": true,\n \"has_institutional_language\": true,\n \"has_technical_terminology\": true,\n \"has_first_person\": false,\n \"has_quotations\": false\n },\n \"domain_class\": \"educational\",\n \"one_sentence_summary\": \"Discussion between two individuals preparing for a dinner meeting, discussing the background and work of Jim Agutter, a university administrator and designer.\"\n}",
|
||||
"api_input_tokens": 2540,
|
||||
"api_output_tokens": 2395,
|
||||
"api_latency_s": 12.96,
|
||||
"metrics": {
|
||||
"n_entities": 41,
|
||||
"n_edges": 57,
|
||||
"predicate_diversity": 35,
|
||||
"type_diversity": 7,
|
||||
"avg_degree": 2.78,
|
||||
"largest_component": 26,
|
||||
"largest_component_pct": 63.4
|
||||
},
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Jim Agutter\", \"type\": \"person\"},\n {\"name\": \"Division of Multi-Disciplinary Design (MDD)\", \"type\": \"organization\"},\n {\"name\": \"University of Utah\", \"type\": \"organization\"},\n {\"name\": \"Office of Undergraduate Studies\", \"type\": \"organization\"},\n {\"name\": \"Innovation Lab\", \"type\": \"organization\"},\n {\"name\": \"College of Architecture + Planning\", \"type\": \"organization\"},\n {\"name\": \"Spark Design Initiative\", \"type\": \"organization\"},\n {\"name\": \"Bachelor of University Studies program\", \"type\": \"program\"},\n {\"name\": \"Frank Drews\", \"type\": \"person\"},\n {\"name\": \"Dave Strayer\", \"type\": \"person\"},\n {\"name\": \"Applied Medical Visualizations (Medvis)\", \"type\": \"organization\"},\n {\"name\": \"GE Healthcare\", \"type\": \"organization\"},\n {\"name\": \"IBM\", \"type\": \"organization\"},\n {\"name\": \"Braskem\", \"type\": \"organization\"},\n {\"name\": \"Selux\", \"type\": \"organization\"},\n {\"name\": \"UGuide\", \"type\": \"project\"},\n {\"name\": \"AI Community of Practice\", \"type\": \"organization\"},\n {\"name\": \"Connect2Health\", \"type\": \"project\"},\n {\"name\": \"Honors Praxis Lab\", \"type\": \"organization\"},\n {\"name\": \"Hope in a Time of Dying\", \"type\": \"work\"},\n {\"name\": \"ACSA\", \"type\": \"organization\"},\n {\"name\": \"HVAMC\", \"type\": \"organization\"},\n {\"name\": \"SUNY New Paltz\", \"type\": \"organization\"},\n {\"name\": \"Kingston lunette\", \"type\": \"project\"},\n {\"name\": \"Stable Diffusion\", \"type\": \"technology\"},\n {\"name\": \"IDSA\", \"type\": \"organization\"},\n {\"name\": \"School of Computing\", \"type\": \"organization\"},\n {\"name\": \"School of Medicine\", \"type\": \"organization\"},\n {\"name\": \"College of Engineering\", \"type\": \"organization\"},\n {\"name\": \"Grasshopper\", \"type\": \"technology\"},\n {\"name\": \"Python\", \"type\": \"technology\"},\n {\"name\": \"FDM\", \"type\": \"technology\"},\n {\"name\": \"PolyJet\", \"type\": \"technology\"},\n {\"name\": \"SLA\", \"type\": \"technology\"},\n {\"name\": \"DMLS\", \"type\": \"technology\"},\n {\"name\": \"CNC\", \"type\": \"technology\"},\n {\"name\": \"Creative Achievement Award\", \"type\": \"award\"},\n {\"name\": \"University Honors Professorship\", \"type\": \"award\"},\n {\"name\": \"Early Career Teaching Award\", \"type\": \"award\"},\n {\"name\": \"Beacon of Excellence Award\", \"type\": \"award\"},\n {\"name\": \"Distinguished Innovation and Impact Award\", \"type\": \"award\"}\n ],\n \"edges\": [\n {\"subject\": \"Jim Agutter\", \"predicate\": \"is founder of\", \"object\": \"Division of Multi-Disciplinary Design (MDD)\"},\n {\"subject\": \"Jim Agutter\", \"predicate\": \"chairs search committee for\", \"object\": \"Division of Multi-Disciplinary Design (MDD)\"},\n {\"subject\": \"Division of Multi-Disciplinary Design (MDD)\", \"predicate\": \"is located at\", \"object\": \"University of Utah\"},\n {\"subject\": \"Jim Agutter\", \"predicate\": \"holds title of\", \"object\": \"Senior Associate Dean of Faculty Success & Academic Innovation\"},\n {\"subject\": \"Jim Agutter\", \"predicate\": \"works in\", \"object\": \"Office of Undergraduate Studies\"},\n {\"subject\": \"Jim Agutter\", \"predicate\": \"is Lab Director of\", \"object\": \"Innovation Lab\"},\n {\"subject\": \"Jim Agutter\", \"predicate\": \"holds appointment in\", \"object\": \"College of Architecture + Planning\"},\n {\"subject\": \"Jim Agutter\", \"predicate\": \"directs\", \"object\": \"Spark Design Initiative\"},\n {\"subject\": \"Jim Agutter\", \"predicate\": \"directs\", \"object\": \"Bachelor of University Studies program\"},\n {\"subject\": \"Jim Agutter\", \"predicate\": \"is past director of\", \"object\": \"Division of Multi-Disciplinary Design (MDD)\"},\n {\"subject\": \"Jim Agutter\", \"predicate\": \"researches\", \"object\": \"medical visualization\"},\n {\"subject\": \"Jim Agutter\", \"predicate\": \"researches\", \"object\": \"ICU nurse situation awareness\"},\n {\"subject\": \"Jim Agutter\", \"predicate\": \"researches\", \"object\": \"integrated information displays for medication management\"},\n {\"subject\": \"Jim Agutter\", \"predicate\": \"researches\", \"object\": \"graphical visualization tools for arterial blood gas data\"},\n {\"subject\": \"Jim Agutte",
|
||||
"error": null
|
||||
},
|
||||
"_type": "conversational"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,482 @@
|
||||
{
|
||||
"meta": {
|
||||
"docs": 10,
|
||||
"seed": 42,
|
||||
"forced_sources": [
|
||||
"2019-2020 Research and Creative Projects Awards Guidelines.FINAL.pdf",
|
||||
"2021 - DDF 320 - Design Intents Syllabus.pdf"
|
||||
]
|
||||
},
|
||||
"pairs": [
|
||||
{
|
||||
"source": "03_Editing Geometry.docx",
|
||||
"size_bucket": "small",
|
||||
"doc_chars": 171,
|
||||
"metrics_a": {
|
||||
"n_entities": 7,
|
||||
"n_edges": 8,
|
||||
"predicate_diversity": 4,
|
||||
"type_diversity": 5,
|
||||
"avg_degree": 2.29,
|
||||
"largest_component": 7,
|
||||
"largest_component_pct": 100.0
|
||||
},
|
||||
"metrics_b": {
|
||||
"n_entities": 8,
|
||||
"n_edges": 10,
|
||||
"predicate_diversity": 6,
|
||||
"type_diversity": 4,
|
||||
"avg_degree": 2.5,
|
||||
"largest_component": 8,
|
||||
"largest_component_pct": 100.0
|
||||
},
|
||||
"extraction_a": "```json\n{\n \"entities\": [\n {\"name\": \"Editing Commands\", \"type\": \"Feature\"},\n {\"name\": \"geometry\", \"type\": \"Object\"},\n {\"name\": \"sections\", \"type\": \"Component\"},\n {\"name\": \"form\", \"type\": \"Structure\"},\n {\"name\": \"adding\", \"type\": \"Operation\"},\n {\"name\": \"combining\", \"type\": \"Operation\"},\n {\"name\": \"removing\", \"type\": \"Operation\"}\n ],\n \"edges\": [\n {\"subject\": \"Editing Commands\", \"predicate\": \"can edit\", \"object\": \"geometry\"},\n {\"subject\": \"geometry\", \"predicate\": \"can be further edited by\", \"object\": \"adding\"},\n {\"subject\": \"geometry\", \"predicate\": \"can be further edited by\", \"object\": \"combining\"},\n {\"subject\": \"geometry\", \"predicate\": \"can be further edited by\", \"object\": \"removing\"},\n {\"subject\": \"adding\", \"predicate\": \"modifies\", \"object\": \"sections\"},\n {\"subject\": \"combining\", \"predicate\": \"modifies\", \"object\": \"sections\"},\n {\"subject\": \"removing\", \"predicate\": \"modifies\", \"object\": \"sections\"},\n {\"subject\": \"sections\", \"predicate\": \"are part of\", \"object\": \"form\"}\n ]\n}\n```",
|
||||
"extraction_b": "```json\n{\n \"entities\": [\n {\"name\": \"Editing Commands\", \"type\": \"concept\"},\n {\"name\": \"geometry\", \"type\": \"technical_object\"},\n {\"name\": \"form\", \"type\": \"technical_object\"},\n {\"name\": \"sections\", \"type\": \"technical_component\"},\n {\"name\": \"adding\", \"type\": \"operation\"},\n {\"name\": \"combining\", \"type\": \"operation\"},\n {\"name\": \"removing\", \"type\": \"operation\"},\n {\"name\": \"methods\", \"type\": \"concept\"}\n ],\n \"edges\": [\n {\"subject\": \"geometry\", \"predicate\": \"can be edited through\", \"object\": \"Editing Commands\"},\n {\"subject\": \"geometry\", \"predicate\": \"is part of\", \"object\": \"form\"},\n {\"subject\": \"adding\", \"predicate\": \"is a method for editing\", \"object\": \"geometry\"},\n {\"subject\": \"combining\", \"predicate\": \"is a method for editing\", \"object\": \"geometry\"},\n {\"subject\": \"removing\", \"predicate\": \"is a method for editing\", \"object\": \"geometry\"},\n {\"subject\": \"adding\", \"predicate\": \"operates on\", \"object\": \"sections\"},\n {\"subject\": \"combining\", \"predicate\": \"operates on\", \"object\": \"sections\"},\n {\"subject\": \"removing\", \"predicate\": \"operates on\", \"object\": \"sections\"},\n {\"subject\": \"sections\", \"predicate\": \"are part of\", \"object\": \"form\"},\n {\"subject\": \"methods\", \"predicate\": \"enable editing of\", \"object\": \"geometry\"}\n ]\n}\n```",
|
||||
"metadata_b": {
|
||||
"language": "en",
|
||||
"char_length": 208,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": false,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": false,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "technical",
|
||||
"one_sentence_summary": "This document outlines methods for editing geometry in a form."
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "10_Moving Parts.docx",
|
||||
"size_bucket": "medium",
|
||||
"doc_chars": 2847,
|
||||
"metrics_a": {
|
||||
"n_entities": 27,
|
||||
"n_edges": 23,
|
||||
"predicate_diversity": 14,
|
||||
"type_diversity": 8,
|
||||
"avg_degree": 1.7,
|
||||
"largest_component": 11,
|
||||
"largest_component_pct": 40.7
|
||||
},
|
||||
"metrics_b": {
|
||||
"n_entities": 28,
|
||||
"n_edges": 32,
|
||||
"predicate_diversity": 13,
|
||||
"type_diversity": 10,
|
||||
"avg_degree": 2.29,
|
||||
"largest_component": 21,
|
||||
"largest_component_pct": 75.0
|
||||
},
|
||||
"extraction_a": "```json\n{\n \"entities\": [\n {\"name\": \"Hinge\", \"type\": \"Mechanism\"},\n {\"name\": \"Horizontal Hinge\", \"type\": \"Mechanism\"},\n {\"name\": \"Pin\", \"type\": \"Component\"},\n {\"name\": \"Knuckles\", \"type\": \"Component\"},\n {\"name\": \"Central axis\", \"type\": \"Component\"},\n {\"name\": \"Overhangs\", \"type\": \"Feature\"},\n {\"name\": \"Vertical Pin/Hinge\", \"type\": \"Mechanism\"},\n {\"name\": \"PLA\", \"type\": \"Material\"},\n {\"name\": \"Build platform\", \"type\": \"Equipment\"},\n {\"name\": \"Ball in Socket\", \"type\": \"Mechanism\"},\n {\"name\": \"Ball\", \"type\": \"Component\"},\n {\"name\": \"Socket\", \"type\": \"Component\"},\n {\"name\": \"Bridges\", \"type\": \"Feature\"},\n {\"name\": \"Gyroscopic effect\", \"type\": \"Property\"},\n {\"name\": \"Screw\", \"type\": \"Mechanism\"},\n {\"name\": \"Ridge\", \"type\": \"Component\"},\n {\"name\": \"Groove\", \"type\": \"Component\"},\n {\"name\": \"Helix\", \"type\": \"Shape\"},\n {\"name\": \"Chain links\", \"type\": \"Mechanism\"},\n {\"name\": \"Fabric\", \"type\": \"Mechanism\"},\n {\"name\": \"Tessellation\", \"type\": \"Property\"},\n {\"name\": \"Mesh\", \"type\": \"Structure\"},\n {\"name\": \"Nesting\", \"type\": \"Mechanism\"},\n {\"name\": \"Telescoping\", \"type\": \"Mechanism\"},\n {\"name\": \"Straight Telescope\", \"type\": \"Mechanism\"},\n {\"name\": \"Wedging Telescope\", \"type\": \"Mechanism\"},\n {\"name\": \"Guide pin\", \"type\": \"Component\"}\n ],\n \"edges\": [\n {\"subject\": \"Hinge\", \"predicate\": \"allows for\", \"object\": \"Rotation\"},\n {\"subject\": \"Hinge\", \"predicate\": \"comprises\", \"object\": \"Pin\"},\n {\"subject\": \"Hinge\", \"predicate\": \"comprises\", \"object\": \"Central axis\"},\n {\"subject\": \"Hinge\", \"predicate\": \"comprises\", \"object\": \"Knuckles\"},\n {\"subject\": \"Horizontal Hinge\", \"predicate\": \"is a type of\", \"object\": \"Hinge\"},\n {\"subject\": \"Horizontal Hinge\", \"predicate\": \"takes advantage of\", \"object\": \"Overhangs\"},\n {\"subject\": \"Vertical Pin/Hinge\", \"predicate\": \"can be printed in\", \"object\": \"PLA\"},\n {\"subject\": \"Vertical Pin/Hinge\", \"predicate\": \"can be printed without\", \"object\": \"Supports\"},\n {\"subject\": \"Ball in Socket\", \"predicate\": \"takes advantage of\", \"object\": \"Overhangs\"},\n {\"subject\": \"Ball in Socket\", \"predicate\": \"takes advantage of\", \"object\": \"Bridges\"},\n {\"subject\": \"Ball in Socket\", \"predicate\": \"comprises\", \"object\": \"Ball\"},\n {\"subject\": \"Ball in Socket\", \"predicate\": \"comprises\", \"object\": \"Socket\"},\n {\"subject\": \"Ball\", \"predicate\": \"rolls inside\", \"object\": \"Socket\"},\n {\"subject\": \"Ball in Socket\", \"predicate\": \"can produce\", \"object\": \"Gyroscopic effect\"},\n {\"subject\": \"Screw\", \"predicate\": \"comprises\", \"object\": \"Ridge\"},\n {\"subject\": \"Screw\", \"predicate\": \"comprises\", \"object\": \"Groove\"},\n {\"subject\": \"Ridge\", \"predicate\": \"fits in\", \"object\": \"Groove\"},\n {\"subject\": \"Groove\", \"predicate\": \"travels helically along\", \"object\": \"Axis\"},\n {\"subject\": \"Chain links\", \"predicate\": \"uses\", \"object\": \"Tessellation\"},\n {\"subject\": \"Fabric\", \"predicate\": \"uses\", \"object\": \"Tessellation\"},\n {\"subject\": \"Chain links\", \"predicate\": \"can form\", \"object\": \"Mesh\"},\n {\"subject\": \"Wedging Telescope\", \"predicate\": \"removes need for\", \"object\": \"Guide pin\"},\n {\"subject\": \"Wedging Telescope\", \"predicate\": \"wedges itself in\", \"object\": \"Open position\"}\n ]\n}\n```",
|
||||
"extraction_b": "```json\n{\n \"entities\": [\n {\"name\": \"Hinge\", \"type\": \"Mechanical Component\"},\n {\"name\": \"Horizontal Hinge\", \"type\": \"Mechanical Component\"},\n {\"name\": \"Vertical Pin/Hinge\", \"type\": \"Mechanical Component\"},\n {\"name\": \"Pin\", \"type\": \"Component Part\"},\n {\"name\": \"Central Axis\", \"type\": \"Component Part\"},\n {\"name\": \"Knuckles\", \"type\": \"Component Part\"},\n {\"name\": \"Overhangs\", \"type\": \"3D Printing Feature\"},\n {\"name\": \"PLA\", \"type\": \"Material\"},\n {\"name\": \"Ball in Socket\", \"type\": \"Mechanical Component\"},\n {\"name\": \"Ball\", \"type\": \"Component Part\"},\n {\"name\": \"Socket\", \"type\": \"Component Part\"},\n {\"name\": \"Bridges\", \"type\": \"3D Printing Feature\"},\n {\"name\": \"Gyroscopic Effect\", \"type\": \"Mechanical Property\"},\n {\"name\": \"Screw\", \"type\": \"Mechanical Component\"},\n {\"name\": \"Ridge\", \"type\": \"Component Part\"},\n {\"name\": \"Groove\", \"type\": \"Component Part\"},\n {\"name\": \"Helix\", \"type\": \"Geometric Pattern\"},\n {\"name\": \"Chain Links\", \"type\": \"Mechanical Component\"},\n {\"name\": \"Fabric\", \"type\": \"Mechanical Component\"},\n {\"name\": \"Tessellation\", \"type\": \"Geometric Pattern\"},\n {\"name\": \"Mesh\", \"type\": \"Structure\"},\n {\"name\": \"Build Plate\", \"type\": \"3D Printing Equipment\"},\n {\"name\": \"Nesting\", \"type\": \"Assembly Method\"},\n {\"name\": \"Telescoping\", \"type\": \"Assembly Method\"},\n {\"name\": \"Straight Telescope\", \"type\": \"Mechanical Component\"},\n {\"name\": \"Wedging Telescope\", \"type\": \"Mechanical Component\"},\n {\"name\": \"Guide Pin\", \"type\": \"Component Part\"},\n {\"name\": \"3D Printing\", \"type\": \"Manufacturing Process\"}\n ],\n \"edges\": [\n {\"subject\": \"Hinge\", \"predicate\": \"allows for\", \"object\": \"Rotation\"},\n {\"subject\": \"Hinge\", \"predicate\": \"comprises\", \"object\": \"Pin\"},\n {\"subject\": \"Hinge\", \"predicate\": \"comprises\", \"object\": \"Central Axis\"},\n {\"subject\": \"Hinge\", \"predicate\": \"comprises\", \"object\": \"Knuckles\"},\n {\"subject\": \"Horizontal Hinge\", \"predicate\": \"is a type of\", \"object\": \"Hinge\"},\n {\"subject\": \"Horizontal Hinge\", \"predicate\": \"uses\", \"object\": \"Overhangs\"},\n {\"subject\": \"Vertical Pin/Hinge\", \"predicate\": \"is a type of\", \"object\": \"Hinge\"},\n {\"subject\": \"Vertical Pin/Hinge\", \"predicate\": \"can be printed with\", \"object\": \"PLA\"},\n {\"subject\": \"Vertical Pin/Hinge\", \"predicate\": \"takes advantage of\", \"object\": \"Overhangs\"},\n {\"subject\": \"Ball in Socket\", \"predicate\": \"takes advantage of\", \"object\": \"Overhangs\"},\n {\"subject\": \"Ball in Socket\", \"predicate\": \"takes advantage of\", \"object\": \"Bridges\"},\n {\"subject\": \"Ball in Socket\", \"predicate\": \"comprises\", \"object\": \"Ball\"},\n {\"subject\": \"Ball in Socket\", \"predicate\": \"comprises\", \"object\": \"Socket\"},\n {\"subject\": \"Ball\", \"predicate\": \"rolls inside\", \"object\": \"Socket\"},\n {\"subject\": \"Ball in Socket\", \"predicate\": \"can produce\", \"object\": \"Gyroscopic Effect\"},\n {\"subject\": \"Screw\", \"predicate\": \"comprises\", \"object\": \"Ridge\"},\n {\"subject\": \"Screw\", \"predicate\": \"comprises\", \"object\": \"Groove\"},\n {\"subject\": \"Ridge\", \"predicate\": \"travels helically along\", \"object\": \"Axis\"},\n {\"subject\": \"Chain Links\", \"predicate\": \"uses\", \"object\": \"Tessellation\"},\n {\"subject\": \"Fabric\", \"predicate\": \"uses\", \"object\": \"Tessellation\"},\n {\"subject\": \"Chain Links\", \"predicate\": \"can form\", \"object\": \"Mesh\"},\n {\"subject\": \"Telescoping\", \"predicate\": \"is a type of\", \"object\": \"Assembly Method\"},\n {\"subject\": \"Nesting\", \"predicate\": \"is a type of\", \"object\": \"Assembly Method\"},\n {\"subject\": \"Straight Telescope\", \"predicate\": \"is a type of\", \"object\": \"Telescoping\"},\n {\"subject\": \"Wedging Telescope\", \"predicate\": \"is a type of\", \"object\": \"Telescoping\"},\n {\"subject\": \"Wedging Telescope\", \"predicate\": \"removes need for\", \"object\": \"Guide Pin\"},\n {\"subject\": \"Wedging Telescope\", \"predicate\": \"wedges itself in\", \"object\": \"Open Position\"},\n {\"subject\": \"Hinge\", \"predicate\": \"can be created with\", \"object\": \"3D Printing\"},\n {\"subject\": \"Ball in Socket\", \"predi",
|
||||
"metadata_b": {
|
||||
"language": "en",
|
||||
"char_length": 450,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": false,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": false,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "technical",
|
||||
"one_sentence_summary": "The document describes various methods for creating mechanical components such as hinges, screws, and telescopes using 3D printing."
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "02_Point of Curves.docx",
|
||||
"size_bucket": "medium",
|
||||
"doc_chars": 1338,
|
||||
"metrics_a": {
|
||||
"n_entities": 28,
|
||||
"n_edges": 33,
|
||||
"predicate_diversity": 15,
|
||||
"type_diversity": 10,
|
||||
"avg_degree": 2.36,
|
||||
"largest_component": 25,
|
||||
"largest_component_pct": 89.3
|
||||
},
|
||||
"metrics_b": {
|
||||
"n_entities": 31,
|
||||
"n_edges": 41,
|
||||
"predicate_diversity": 21,
|
||||
"type_diversity": 6,
|
||||
"avg_degree": 2.65,
|
||||
"largest_component": 28,
|
||||
"largest_component_pct": 90.3
|
||||
},
|
||||
"extraction_a": "```json\n{\n \"entities\": [\n {\"name\": \"Rhino\", \"type\": \"Software\"},\n {\"name\": \"NURBS\", \"type\": \"Mathematical Representation\"},\n {\"name\": \"Geometry\", \"type\": \"Concept\"},\n {\"name\": \"3-D geometry\", \"type\": \"Geometric Form\"},\n {\"name\": \"2-D line\", \"type\": \"Geometric Shape\"},\n {\"name\": \"circle\", \"type\": \"Geometric Shape\"},\n {\"name\": \"arc\", \"type\": \"Geometric Shape\"},\n {\"name\": \"curve\", \"type\": \"Geometric Shape\"},\n {\"name\": \"3-D organic free-form surface\", \"type\": \"Geometric Shape\"},\n {\"name\": \"solid\", \"type\": \"Geometric Shape\"},\n {\"name\": \"illustration\", \"type\": \"Process\"},\n {\"name\": \"animation\", \"type\": \"Process\"},\n {\"name\": \"manufacturing\", \"type\": \"Process\"},\n {\"name\": \"point\", \"type\": \"Geometric Element\"},\n {\"name\": \"XYZ intersection\", \"type\": \"Coordinate System\"},\n {\"name\": \"2D geometry\", \"type\": \"Geometric Form\"},\n {\"name\": \"3D geometry\", \"type\": \"Geometric Form\"},\n {\"name\": \"line\", \"type\": \"Geometric Shape\"},\n {\"name\": \"polygon\", \"type\": \"Geometric Shape\"},\n {\"name\": \"ellipse\", \"type\": \"Geometric Shape\"},\n {\"name\": \"helix\", \"type\": \"Geometric Shape\"},\n {\"name\": \"spiral\", \"type\": \"Geometric Shape\"},\n {\"name\": \"open curve\", \"type\": \"Curve Type\"},\n {\"name\": \"closed curve\", \"type\": \"Curve Type\"},\n {\"name\": \"planar curve\", \"type\": \"Curve Type\"},\n {\"name\": \"non-planar curve\", \"type\": \"Curve Type\"},\n {\"name\": \"single curve\", \"type\": \"Curve Category\"},\n {\"name\": \"polycurve\", \"type\": \"Curve Category\"}\n ],\n \"edges\": [\n {\"subject\": \"Rhino\", \"predicate\": \"represents all geometry with\", \"object\": \"NURBS\"},\n {\"subject\": \"NURBS\", \"predicate\": \"are mathematical representations of\", \"object\": \"3-D geometry\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"2-D line\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"circle\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"arc\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"curve\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"3-D organic free-form surface\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"solid\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"illustration\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"animation\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"manufacturing\"},\n {\"subject\": \"point\", \"predicate\": \"is the most basic element in\", \"object\": \"Rhino\"},\n {\"subject\": \"point\", \"predicate\": \"represents\", \"object\": \"XYZ intersection\"},\n {\"subject\": \"point\", \"predicate\": \"is used to define\", \"object\": \"2D geometry\"},\n {\"subject\": \"point\", \"predicate\": \"is used to define\", \"object\": \"3D geometry\"},\n {\"subject\": \"curve\", \"predicate\": \"is a 2D geometric form in Rhino\", \"object\": \"Geometry\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"line\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"circle\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"arc\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"polygon\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"ellipse\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"helix\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"spiral\"},\n {\"subject\": \"curve\", \"predicate\": \"can be\", \"object\": \"open curve\"},\n {\"subject\": \"curve\", \"predicate\": \"can be\", \"object\": \"closed curve\"},\n {\"subject\": \"curve\", \"predicate\": \"can be\", \"object\": \"planar curve\"},\n {\"subject\": \"curve\", \"predicate\": \"can be\", \"object\": \"non-planar curve\"},\n {\"subject\": \"curve\", \"predicate\": \"are building blocks for\", \"object\": \"complex geometries\"},\n {\"subject\": \"2D geometry\", \"predicate\": \"can be described as\", \"object\": \"single curve\"},\n {\"subject\": \"2D geometry\", \"predicate\": \"can be descri",
|
||||
"extraction_b": "```json\n{\n \"entities\": [\n {\"name\": \"Rhino\", \"type\": \"Software\"},\n {\"name\": \"Non-Uniform Rational B-Splines\", \"type\": \"Mathematical Representation\"},\n {\"name\": \"NURBS\", \"type\": \"Mathematical Representation\"},\n {\"name\": \"3-D geometry\", \"type\": \"Geometric Form\"},\n {\"name\": \"2-D line\", \"type\": \"Geometric Form\"},\n {\"name\": \"circle\", \"type\": \"Geometric Form\"},\n {\"name\": \"arc\", \"type\": \"Geometric Form\"},\n {\"name\": \"curve\", \"type\": \"Geometric Form\"},\n {\"name\": \"3-D organic free-form surface\", \"type\": \"Geometric Form\"},\n {\"name\": \"solid\", \"type\": \"Geometric Form\"},\n {\"name\": \"illustration\", \"type\": \"Application Domain\"},\n {\"name\": \"animation\", \"type\": \"Application Domain\"},\n {\"name\": \"manufacturing\", \"type\": \"Application Domain\"},\n {\"name\": \"point\", \"type\": \"Geometric Element\"},\n {\"name\": \"area\", \"type\": \"Geometric Property\"},\n {\"name\": \"volume\", \"type\": \"Geometric Property\"},\n {\"name\": \"XYZ intersection\", \"type\": \"Geometric Property\"},\n {\"name\": \"2D geometry\", \"type\": \"Geometric Form\"},\n {\"name\": \"3D geometry\", \"type\": \"Geometric Form\"},\n {\"name\": \"line\", \"type\": \"Geometric Form\"},\n {\"name\": \"polygon\", \"type\": \"Geometric Form\"},\n {\"name\": \"ellipse\", \"type\": \"Geometric Form\"},\n {\"name\": \"helix\", \"type\": \"Geometric Form\"},\n {\"name\": \"spiral\", \"type\": \"Geometric Form\"},\n {\"name\": \"open curve\", \"type\": \"Geometric Form\"},\n {\"name\": \"closed curve\", \"type\": \"Geometric Form\"},\n {\"name\": \"planar curve\", \"type\": \"Geometric Form\"},\n {\"name\": \"non-planar curve\", \"type\": \"Geometric Form\"},\n {\"name\": \"single curve\", \"type\": \"Geometric Form\"},\n {\"name\": \"polycurve\", \"type\": \"Geometric Form\"},\n {\"name\": \"end points\", \"type\": \"Geometric Property\"}\n ],\n \"edges\": [\n {\"subject\": \"Rhino\", \"predicate\": \"represents all geometry with\", \"object\": \"Non-Uniform Rational B-Splines\"},\n {\"subject\": \"NURBS\", \"predicate\": \"is abbreviation for\", \"object\": \"Non-Uniform Rational B-Splines\"},\n {\"subject\": \"NURBS\", \"predicate\": \"are mathematical representations of\", \"object\": \"3-D geometry\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"2-D line\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"circle\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"arc\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"curve\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"3-D organic free-form surface\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"solid\"},\n {\"subject\": \"NURBS\", \"predicate\": \"have property of\", \"object\": \"flexibility\"},\n {\"subject\": \"NURBS\", \"predicate\": \"have property of\", \"object\": \"accuracy\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"illustration\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"animation\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"manufacturing\"},\n {\"subject\": \"point\", \"predicate\": \"is most basic element in\", \"object\": \"Rhino\"},\n {\"subject\": \"point\", \"predicate\": \"represents\", \"object\": \"XYZ intersection\"},\n {\"subject\": \"point\", \"predicate\": \"does not have\", \"object\": \"area\"},\n {\"subject\": \"point\", \"predicate\": \"does not have\", \"object\": \"volume\"},\n {\"subject\": \"point\", \"predicate\": \"is used as\", \"object\": \"place holder\"},\n {\"subject\": \"point\", \"predicate\": \"is used to define\", \"object\": \"2D geometry\"},\n {\"subject\": \"point\", \"predicate\": \"is used to define\", \"object\": \"3D geometry\"},\n {\"subject\": \"curve\", \"predicate\": \"is 2D geometric form in\", \"object\": \"Rhino\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"line\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"arc\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"circle\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"polygon\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"ellipse\"},\n ",
|
||||
"metadata_b": {
|
||||
"language": "en",
|
||||
"char_length": 289,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": false,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "technical",
|
||||
"one_sentence_summary": "The document describes the use of Non-Uniform Rational B-Splines (NURBS) in Rhino for representing and generating various geometric forms."
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "02_Point of Curves - AARON.docx",
|
||||
"size_bucket": "medium",
|
||||
"doc_chars": 2116,
|
||||
"metrics_a": {
|
||||
"n_entities": 43,
|
||||
"n_edges": 55,
|
||||
"predicate_diversity": 22,
|
||||
"type_diversity": 13,
|
||||
"avg_degree": 2.56,
|
||||
"largest_component": 29,
|
||||
"largest_component_pct": 67.4
|
||||
},
|
||||
"metrics_b": {
|
||||
"n_entities": 42,
|
||||
"n_edges": 58,
|
||||
"predicate_diversity": 24,
|
||||
"type_diversity": 14,
|
||||
"avg_degree": 2.76,
|
||||
"largest_component": 26,
|
||||
"largest_component_pct": 61.9
|
||||
},
|
||||
"extraction_a": "```json\n{\n \"entities\": [\n {\"name\": \"Rhino\", \"type\": \"Software\"},\n {\"name\": \"NURBS\", \"type\": \"Mathematical Representation\"},\n {\"name\": \"Geometry\", \"type\": \"Concept\"},\n {\"name\": \"3-D geometry\", \"type\": \"Geometric Type\"},\n {\"name\": \"2-D line\", \"type\": \"Geometric Shape\"},\n {\"name\": \"circle\", \"type\": \"Geometric Shape\"},\n {\"name\": \"arc\", \"type\": \"Geometric Shape\"},\n {\"name\": \"curve\", \"type\": \"Geometric Shape\"},\n {\"name\": \"3-D organic free-form surface\", \"type\": \"Geometric Shape\"},\n {\"name\": \"solid\", \"type\": \"Geometric Shape\"},\n {\"name\": \"point\", \"type\": \"Geometric Element\"},\n {\"name\": \"3D Cartesian space\", \"type\": \"Coordinate System\"},\n {\"name\": \"X-axis\", \"type\": \"Coordinate Axis\"},\n {\"name\": \"Y-axis\", \"type\": \"Coordinate Axis\"},\n {\"name\": \"Z-axis\", \"type\": \"Coordinate Axis\"},\n {\"name\": \"2D geometry\", \"type\": \"Geometric Type\"},\n {\"name\": \"3D geometry\", \"type\": \"Geometric Type\"},\n {\"name\": \"line\", \"type\": \"Geometric Shape\"},\n {\"name\": \"polygon\", \"type\": \"Geometric Shape\"},\n {\"name\": \"ellipse\", \"type\": \"Geometric Shape\"},\n {\"name\": \"helix\", \"type\": \"Geometric Shape\"},\n {\"name\": \"spiral\", \"type\": \"Geometric Shape\"},\n {\"name\": \"open curve\", \"type\": \"Curve Type\"},\n {\"name\": \"closed curve\", \"type\": \"Curve Type\"},\n {\"name\": \"planar curve\", \"type\": \"Curve Type\"},\n {\"name\": \"non-planar curve\", \"type\": \"Curve Type\"},\n {\"name\": \"single curve\", \"type\": \"Curve Type\"},\n {\"name\": \"polycurve\", \"type\": \"Curve Type\"},\n {\"name\": \"polyline\", \"type\": \"Geometric Shape\"},\n {\"name\": \"illustration\", \"type\": \"Application\"},\n {\"name\": \"animation\", \"type\": \"Application\"},\n {\"name\": \"manufacturing\", \"type\": \"Application\"},\n {\"name\": \"Top viewport\", \"type\": \"Interface Element\"},\n {\"name\": \"small object template\", \"type\": \"Template\"},\n {\"name\": \"Circle layer\", \"type\": \"Layer\"},\n {\"name\": \"Star layer\", \"type\": \"Layer\"},\n {\"name\": \"Rectangle layer\", \"type\": \"Layer\"},\n {\"name\": \"Polyline layer\", \"type\": \"Layer\"},\n {\"name\": \"Curve layer\", \"type\": \"Layer\"},\n {\"name\": \"Shape layer\", \"type\": \"Layer\"},\n {\"name\": \"rounded rectangle\", \"type\": \"Geometric Shape\"},\n {\"name\": \"open polyline\", \"type\": \"Geometric Shape\"},\n {\"name\": \"closed shape\", \"type\": \"Geometric Shape\"}\n ],\n \"edges\": [\n {\"subject\": \"Rhino\", \"predicate\": \"represents all geometry with\", \"object\": \"NURBS\"},\n {\"subject\": \"NURBS\", \"predicate\": \"is a\", \"object\": \"Mathematical Representation\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"2-D line\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"circle\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"arc\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"curve\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"3-D organic free-form surface\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"solid\"},\n {\"subject\": \"NURBS\", \"predicate\": \"have\", \"object\": \"flexibility\"},\n {\"subject\": \"NURBS\", \"predicate\": \"have\", \"object\": \"accuracy\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"illustration\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"animation\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"manufacturing\"},\n {\"subject\": \"point\", \"predicate\": \"is the most basic element in\", \"object\": \"Rhino\"},\n {\"subject\": \"point\", \"predicate\": \"represents a position in\", \"object\": \"3D Cartesian space\"},\n {\"subject\": \"point\", \"predicate\": \"is expressed as coordinates along\", \"object\": \"X-axis\"},\n {\"subject\": \"point\", \"predicate\": \"is expressed as coordinates along\", \"object\": \"Y-axis\"},\n {\"subject\": \"point\", \"predicate\": \"is expressed as coordinates along\", \"object\": \"Z-axis\"},\n {\"subject\": \"points\", \"predicate\": \"are building blocks that define\", \"object\": \"2D geometry\"},\n {\"subject\": \"points\", \"predicate\": \"are building block",
|
||||
"extraction_b": "```json\n{\n \"entities\": [\n {\"name\": \"Geometry\", \"type\": \"concept\"},\n {\"name\": \"Rhino\", \"type\": \"software\"},\n {\"name\": \"Non-Uniform Rational B-Splines\", \"type\": \"mathematical_concept\"},\n {\"name\": \"NURBS\", \"type\": \"mathematical_concept\"},\n {\"name\": \"3-D geometry\", \"type\": \"geometric_concept\"},\n {\"name\": \"2-D line\", \"type\": \"geometric_shape\"},\n {\"name\": \"circle\", \"type\": \"geometric_shape\"},\n {\"name\": \"arc\", \"type\": \"geometric_shape\"},\n {\"name\": \"curve\", \"type\": \"geometric_shape\"},\n {\"name\": \"3-D organic free-form surface\", \"type\": \"geometric_shape\"},\n {\"name\": \"solid\", \"type\": \"geometric_shape\"},\n {\"name\": \"point\", \"type\": \"geometric_element\"},\n {\"name\": \"3D Cartesian space\", \"type\": \"coordinate_system\"},\n {\"name\": \"X-axis\", \"type\": \"coordinate_axis\"},\n {\"name\": \"Y-axis\", \"type\": \"coordinate_axis\"},\n {\"name\": \"Z-axis\", \"type\": \"coordinate_axis\"},\n {\"name\": \"2D geometry\", \"type\": \"geometric_concept\"},\n {\"name\": \"3D geometry\", \"type\": \"geometric_concept\"},\n {\"name\": \"line\", \"type\": \"geometric_shape\"},\n {\"name\": \"polygon\", \"type\": \"geometric_shape\"},\n {\"name\": \"ellipse\", \"type\": \"geometric_shape\"},\n {\"name\": \"helix\", \"type\": \"geometric_shape\"},\n {\"name\": \"spiral\", \"type\": \"geometric_shape\"},\n {\"name\": \"open curve\", \"type\": \"curve_type\"},\n {\"name\": \"closed curve\", \"type\": \"curve_type\"},\n {\"name\": \"planar curve\", \"type\": \"curve_type\"},\n {\"name\": \"non-planar curve\", \"type\": \"curve_type\"},\n {\"name\": \"single curve\", \"type\": \"curve_type\"},\n {\"name\": \"polycurve\", \"type\": \"curve_type\"},\n {\"name\": \"polyline\", \"type\": \"geometric_shape\"},\n {\"name\": \"illustration\", \"type\": \"application_domain\"},\n {\"name\": \"animation\", \"type\": \"application_domain\"},\n {\"name\": \"manufacturing\", \"type\": \"application_domain\"},\n {\"name\": \"small object template\", \"type\": \"template\"},\n {\"name\": \"Top viewport\", \"type\": \"viewport\"},\n {\"name\": \"layer\", \"type\": \"organizational_element\"},\n {\"name\": \"Star\", \"type\": \"geometric_shape\"},\n {\"name\": \"Rectangle\", \"type\": \"geometric_shape\"},\n {\"name\": \"rounded corners\", \"type\": \"geometric_feature\"},\n {\"name\": \"Open Poly-line\", \"type\": \"geometric_shape\"},\n {\"name\": \"Closed Curve\", \"type\": \"geometric_shape\"},\n {\"name\": \"Shape\", \"type\": \"geometric_concept\"}\n ],\n \"edges\": [\n {\"subject\": \"Geometry\", \"predicate\": \"is represented with\", \"object\": \"NURBS\"},\n {\"subject\": \"NURBS\", \"predicate\": \"is also known as\", \"object\": \"Non-Uniform Rational B-Splines\"},\n {\"subject\": \"NURBS\", \"predicate\": \"are mathematical representations of\", \"object\": \"3-D geometry\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"2-D line\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"circle\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"arc\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"curve\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"3-D organic free-form surface\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"solid\"},\n {\"subject\": \"NURBS\", \"predicate\": \"have property of\", \"object\": \"flexibility\"},\n {\"subject\": \"NURBS\", \"predicate\": \"have property of\", \"object\": \"accuracy\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"illustration\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"animation\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"manufacturing\"},\n {\"subject\": \"point\", \"predicate\": \"is the most basic element in\", \"object\": \"Rhino\"},\n {\"subject\": \"point\", \"predicate\": \"represents\", \"object\": \"position in 3D Cartesian space\"},\n {\"subject\": \"point\", \"predicate\": \"is expressed as\", \"object\": \"coordinates\"},\n {\"subject\": \"coordinates\", \"predicate\": \"are along\", \"object\": \"X-axis\"},\n {\"subject\": \"coordinates\", \"predicate\": \"are along\", \"object\": \"Y-axis\"},\n {\"subject\": \"coordinates\", \"predicate\": \"are",
|
||||
"metadata_b": {
|
||||
"language": "en",
|
||||
"char_length": 320,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": false,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": false,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "technical",
|
||||
"one_sentence_summary": "Document describes the basics of Geometry in Rhino software"
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "2019-2020 Research and Creative Projects Awards Guidelines.FINAL.pdf",
|
||||
"size_bucket": "large",
|
||||
"doc_chars": 12000,
|
||||
"metrics_a": {
|
||||
"n_entities": 84,
|
||||
"n_edges": 129,
|
||||
"predicate_diversity": 72,
|
||||
"type_diversity": 20,
|
||||
"avg_degree": 3.07,
|
||||
"largest_component": 7,
|
||||
"largest_component_pct": 8.3
|
||||
},
|
||||
"metrics_b": {
|
||||
"n_entities": 79,
|
||||
"n_edges": 98,
|
||||
"predicate_diversity": 55,
|
||||
"type_diversity": 26,
|
||||
"avg_degree": 2.48,
|
||||
"largest_component": 12,
|
||||
"largest_component_pct": 15.2
|
||||
},
|
||||
"extraction_a": "```json\n{\n \"entities\": [\n {\"name\": \"Research and Creative Projects Awards Program\", \"type\": \"Program\"},\n {\"name\": \"Faculty\", \"type\": \"Role\"},\n {\"name\": \"Division of Academic Affairs\", \"type\": \"Organization\"},\n {\"name\": \"Committee on Research, Awards and Leaves (CRAL)\", \"type\": \"Committee\"},\n {\"name\": \"Provost\", \"type\": \"Role\"},\n {\"name\": \"Office of Sponsored Programs\", \"type\": \"Organization\"},\n {\"name\": \"Carrie Corti\", \"type\": \"Person\"},\n {\"name\": \"Department Chair\", \"type\": \"Role\"},\n {\"name\": \"Dean\", \"type\": \"Role\"},\n {\"name\": \"Human Research Ethics Board (HREB)\", \"type\": \"Committee\"},\n {\"name\": \"Office of Academic Affairs\", \"type\": \"Organization\"},\n {\"name\": \"Student\", \"type\": \"Role\"},\n {\"name\": \"Tenure-track teaching faculty\", \"type\": \"Role\"},\n {\"name\": \"Term lecturers\", \"type\": \"Role\"},\n {\"name\": \"Tenured faculty\", \"type\": \"Role\"},\n {\"name\": \"Non-tenured tenure-track teaching faculty\", \"type\": \"Role\"},\n {\"name\": \"Payroll Department\", \"type\": \"Organization\"},\n {\"name\": \"IMS\", \"type\": \"Organization\"},\n {\"name\": \"IT\", \"type\": \"Organization\"},\n {\"name\": \"State of New York\", \"type\": \"Organization\"},\n {\"name\": \"Library\", \"type\": \"Organization\"},\n {\"name\": \"State University of New York at New Paltz\", \"type\": \"Organization\"},\n {\"name\": \"April 1, 2019\", \"type\": \"Date\"},\n {\"name\": \"March 31, 2020\", \"type\": \"Date\"},\n {\"name\": \"November 30, 2018\", \"type\": \"Date\"},\n {\"name\": \"December 21, 2018\", \"type\": \"Date\"},\n {\"name\": \"January 11, 2019\", \"type\": \"Date\"},\n {\"name\": \"January 18, 2019\", \"type\": \"Date\"},\n {\"name\": \"February 15, 2019\", \"type\": \"Date\"},\n {\"name\": \"March 1, 2019\", \"type\": \"Date\"},\n {\"name\": \"$2,500\", \"type\": \"Amount\"},\n {\"name\": \"$11.10 per hour\", \"type\": \"Rate\"},\n {\"name\": \"$0.54 per mile\", \"type\": \"Rate\"},\n {\"name\": \"Faculty research\", \"type\": \"Activity\"},\n {\"name\": \"Creative activity\", \"type\": \"Activity\"},\n {\"name\": \"Student learning\", \"type\": \"Concept\"},\n {\"name\": \"Faculty scholarship\", \"type\": \"Concept\"},\n {\"name\": \"New knowledge\", \"type\": \"Concept\"},\n {\"name\": \"Educational experience\", \"type\": \"Concept\"},\n {\"name\": \"External funding\", \"type\": \"Concept\"},\n {\"name\": \"InfoEd profile\", \"type\": \"Tool\"},\n {\"name\": \"Project narrative\", \"type\": \"Document\"},\n {\"name\": \"Cover sheet\", \"type\": \"Document\"},\n {\"name\": \"Budget\", \"type\": \"Document\"},\n {\"name\": \"Vitae\", \"type\": \"Document\"},\n {\"name\": \"Final report\", \"type\": \"Document\"},\n {\"name\": \"Evaluation criteria\", \"type\": \"Concept\"},\n {\"name\": \"Contribution\", \"type\": \"Criterion\"},\n {\"name\": \"Dissemination\", \"type\": \"Criterion\"},\n {\"name\": \"Methodology\", \"type\": \"Criterion\"},\n {\"name\": \"Capability\", \"type\": \"Criterion\"},\n {\"name\": \"Budget justification\", \"type\": \"Criterion\"},\n {\"name\": \"Tenured/permanent status\", \"type\": \"Criterion\"},\n {\"name\": \"Student Assistance\", \"type\": \"Budget Category\"},\n {\"name\": \"Travel\", \"type\": \"Budget Category\"},\n {\"name\": \"Supplies and materials\", \"type\": \"Budget Category\"},\n {\"name\": \"Other\", \"type\": \"Budget Category\"},\n {\"name\": \"Laboratory supplies\", \"type\": \"Item\"},\n {\"name\": \"Books and journals\", \"type\": \"Item\"},\n {\"name\": \"Artist supplies\", \"type\": \"Item\"},\n {\"name\": \"Office supplies\", \"type\": \"Item\"},\n {\"name\": \"Computer supplies\", \"type\": \"Item\"},\n {\"name\": \"Software\", \"type\": \"Item\"},\n {\"name\": \"Equipment\", \"type\": \"Item\"},\n {\"name\": \"Tape recorders\", \"type\": \"Equipment\"},\n {\"name\": \"Cameras\", \"type\": \"Equipment\"},\n {\"name\": \"Video recorders\", \"type\": \"Equipment\"},\n {\"name\": \"Duplicating costs\", \"type\": \"Expense\"},\n {\"name\": \"Postage\", \"type\": \"Expense\"},\n {\"name\": \"Publication related costs\", \"type\": \"Expense\"},\n {\"name\": \"Computer data searches\", \"type\": \"Expense\"},\n {\"name\": \"Human subjects\", \"type\": \"Concept\"},\n {\"name\": \"Human subjects training program\", \"type\": \"Program\"},\n {\"name\": \"Conference attendance\", \"type\": \"",
|
||||
"extraction_b": "```json\n{\n \"entities\": [\n {\"name\": \"Research and Creative Projects Awards Program\", \"type\": \"Program\"},\n {\"name\": \"SUNY New Paltz\", \"type\": \"Institution\"},\n {\"name\": \"Division of Academic Affairs\", \"type\": \"Organizational Unit\"},\n {\"name\": \"Committee on Research, Awards and Leaves (CRAL)\", \"type\": \"Committee\"},\n {\"name\": \"Provost\", \"type\": \"Position\"},\n {\"name\": \"Office of Sponsored Programs\", \"type\": \"Organizational Unit\"},\n {\"name\": \"Carrie Corti\", \"type\": \"Person\"},\n {\"name\": \"cortic@newpaltz.edu\", \"type\": \"Email Address\"},\n {\"name\": \"Department Chair\", \"type\": \"Position\"},\n {\"name\": \"Dean\", \"type\": \"Position\"},\n {\"name\": \"Office of Academic Affairs\", \"type\": \"Organizational Unit\"},\n {\"name\": \"researchcreative@newpaltz.edu\", \"type\": \"Email Address\"},\n {\"name\": \"Human Research Ethics Board (HREB)\", \"type\": \"Committee\"},\n {\"name\": \"InfoEd\", \"type\": \"System\"},\n {\"name\": \"IMS\", \"type\": \"Organizational Unit\"},\n {\"name\": \"IT\", \"type\": \"Organizational Unit\"},\n {\"name\": \"Payroll Department\", \"type\": \"Organizational Unit\"},\n {\"name\": \"Library\", \"type\": \"Organizational Unit\"},\n {\"name\": \"State of New York\", \"type\": \"Government Entity\"},\n {\"name\": \"2019-2020\", \"type\": \"Academic Year\"},\n {\"name\": \"April 1, 2019\", \"type\": \"Date\"},\n {\"name\": \"March 31, 2020\", \"type\": \"Date\"},\n {\"name\": \"November 30, 2018\", \"type\": \"Date\"},\n {\"name\": \"December 21, 2018\", \"type\": \"Date\"},\n {\"name\": \"January 11, 2019\", \"type\": \"Date\"},\n {\"name\": \"January 18, 2019\", \"type\": \"Date\"},\n {\"name\": \"February 15, 2019\", \"type\": \"Date\"},\n {\"name\": \"March 1, 2019\", \"type\": \"Date\"},\n {\"name\": \"Full-time tenure-track teaching faculty\", \"type\": \"Eligible Applicant\"},\n {\"name\": \"Full-time term lecturers\", \"type\": \"Eligible Applicant\"},\n {\"name\": \"$2,500\", \"type\": \"Monetary Amount\"},\n {\"name\": \"$11.10 per hour\", \"type\": \"Wage Rate\"},\n {\"name\": \"$0.54 per mile\", \"type\": \"Mileage Rate\"},\n {\"name\": \"x3281\", \"type\": \"Phone Extension\"},\n {\"name\": \"x3282\", \"type\": \"Phone Extension\"},\n {\"name\": \"Project narrative\", \"type\": \"Document Component\"},\n {\"name\": \"Cover sheet\", \"type\": \"Document Component\"},\n {\"name\": \"Budget\", \"type\": \"Document Component\"},\n {\"name\": \"Current Vitae\", \"type\": \"Document Component\"},\n {\"name\": \"Department chair evaluation\", \"type\": \"Document Component\"},\n {\"name\": \"Dean evaluation\", \"type\": \"Document Component\"},\n {\"name\": \"Student Assistant\", \"type\": \"Position\"},\n {\"name\": \"Contribution criterion\", \"type\": \"Evaluation Criterion\"},\n {\"name\": \"Dissemination criterion\", \"type\": \"Evaluation Criterion\"},\n {\"name\": \"Methodology criterion\", \"type\": \"Evaluation Criterion\"},\n {\"name\": \"Capability criterion\", \"type\": \"Evaluation Criterion\"},\n {\"name\": \"Budget criterion\", \"type\": \"Evaluation Criterion\"},\n {\"name\": \"Tenured/permanent criterion\", \"type\": \"Evaluation Criterion\"},\n {\"name\": \"HREB criterion\", \"type\": \"Evaluation Criterion\"},\n {\"name\": \"Previous support criterion\", \"type\": \"Evaluation Criterion\"},\n {\"name\": \"Faculty research\", \"type\": \"Activity\"},\n {\"name\": \"Creative activity\", \"type\": \"Activity\"},\n {\"name\": \"Student learning\", \"type\": \"Outcome\"},\n {\"name\": \"New knowledge\", \"type\": \"Outcome\"},\n {\"name\": \"Educational experience\", \"type\": \"Outcome\"},\n {\"name\": \"Personnel costs\", \"type\": \"Budget Category\"},\n {\"name\": \"Travel costs\", \"type\": \"Budget Category\"},\n {\"name\": \"Supplies and materials\", \"type\": \"Budget Category\"},\n {\"name\": \"Other costs\", \"type\": \"Budget Category\"},\n {\"name\": \"Laboratory supplies\", \"type\": \"Eligible Expense\"},\n {\"name\": \"Books and journals\", \"type\": \"Eligible Expense\"},\n {\"name\": \"Artist supplies\", \"type\": \"Eligible Expense\"},\n {\"name\": \"Office supplies\", \"type\": \"Eligible Expense\"},\n {\"name\": \"Computer supplies and software\", \"type\": \"Eligible Expense\"},\n {\"name\": \"Small equipment items\", \"type\": \"Eligible Expense\"},\n {\"name\": \"Duplica",
|
||||
"metadata_b": {
|
||||
"language": "en",
|
||||
"char_length": 1420,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": true
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": true,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": true,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "administrative",
|
||||
"one_sentence_summary": "Document outlines guidelines for a research and creative projects awards program at SUNY New Paltz"
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "2021 - DDF 320 - Design Intents Syllabus.pdf",
|
||||
"size_bucket": "large",
|
||||
"doc_chars": 12000,
|
||||
"metrics_a": {
|
||||
"n_entities": 57,
|
||||
"n_edges": 51,
|
||||
"predicate_diversity": 29,
|
||||
"type_diversity": 23,
|
||||
"avg_degree": 1.79,
|
||||
"largest_component": 18,
|
||||
"largest_component_pct": 31.6
|
||||
},
|
||||
"metrics_b": {
|
||||
"n_entities": 142,
|
||||
"n_edges": 175,
|
||||
"predicate_diversity": 57,
|
||||
"type_diversity": 33,
|
||||
"avg_degree": 2.46,
|
||||
"largest_component": 84,
|
||||
"largest_component_pct": 59.2
|
||||
},
|
||||
"extraction_a": "```json\n{\n \"entities\": [\n {\"name\": \"Design Intents DDF 320-01\", \"type\": \"Course\"},\n {\"name\": \"Aaron Nelson\", \"type\": \"Instructor\"},\n {\"name\": \"Spring 2021\", \"type\": \"Semester\"},\n {\"name\": \"EIH 104\", \"type\": \"Location\"},\n {\"name\": \"HVAMC\", \"type\": \"Building\"},\n {\"name\": \"nelsona@newpaltz.edu\", \"type\": \"Email\"},\n {\"name\": \"human-centered design\", \"type\": \"Design Principle\"},\n {\"name\": \"prototype driven design practice\", \"type\": \"Design Principle\"},\n {\"name\": \"mindfulness of process\", \"type\": \"Design Principle\"},\n {\"name\": \"design processes\", \"type\": \"Topic\"},\n {\"name\": \"innovation methodologies\", \"type\": \"Topic\"},\n {\"name\": \"need finding\", \"type\": \"Topic\"},\n {\"name\": \"human factors\", \"type\": \"Topic\"},\n {\"name\": \"visualization\", \"type\": \"Topic\"},\n {\"name\": \"rapid prototyping\", \"type\": \"Topic\"},\n {\"name\": \"team dynamics\", \"type\": \"Topic\"},\n {\"name\": \"storytelling\", \"type\": \"Topic\"},\n {\"name\": \"project leadership\", \"type\": \"Topic\"},\n {\"name\": \"design research\", \"type\": \"Skill\"},\n {\"name\": \"iterative ideation\", \"type\": \"Skill\"},\n {\"name\": \"presentation\", \"type\": \"Skill\"},\n {\"name\": \"DDF minor\", \"type\": \"Academic Program\"},\n {\"name\": \"Homeworks\", \"type\": \"Assignment\"},\n {\"name\": \"Reading Responses\", \"type\": \"Assignment\"},\n {\"name\": \"Group Presentations\", \"type\": \"Assignment\"},\n {\"name\": \"Quizzes\", \"type\": \"Assignment\"},\n {\"name\": \"Class Discussion Participation\", \"type\": \"Assignment\"},\n {\"name\": \"Midterm Presentation\", \"type\": \"Assignment\"},\n {\"name\": \"In Progress Design Review\", \"type\": \"Assignment\"},\n {\"name\": \"Final Prototype Presentations\", \"type\": \"Assignment\"},\n {\"name\": \"Blackboard\", \"type\": \"Platform\"},\n {\"name\": \"Don Norman\", \"type\": \"Author\"},\n {\"name\": \"Design Thinking\", \"type\": \"Concept\"},\n {\"name\": \"Dieter Rams\", \"type\": \"Author\"},\n {\"name\": \"Principals for Good Design\", \"type\": \"Concept\"},\n {\"name\": \"IDEO\", \"type\": \"Organization\"},\n {\"name\": \"10 Step Design Process\", \"type\": \"Framework\"},\n {\"name\": \"SELUX\", \"type\": \"Client\"},\n {\"name\": \"Problem Statement\", \"type\": \"Deliverable\"},\n {\"name\": \"Stakeholder Maps\", \"type\": \"Deliverable\"},\n {\"name\": \"Storyboards\", \"type\": \"Deliverable\"},\n {\"name\": \"Mockups\", \"type\": \"Deliverable\"},\n {\"name\": \"Pitch\", \"type\": \"Deliverable\"},\n {\"name\": \"Disability Resource Center\", \"type\": \"Institution\"},\n {\"name\": \"Student Union\", \"type\": \"Location\"},\n {\"name\": \"Deanna Knapp\", \"type\": \"Person\"},\n {\"name\": \"Jean Vizvary\", \"type\": \"Person\"},\n {\"name\": \"Sojourner Truth Library\", \"type\": \"Institution\"},\n {\"name\": \"Office of Veteran and Military Services\", \"type\": \"Institution\"},\n {\"name\": \"SUNY New Paltz\", \"type\": \"Institution\"},\n {\"name\": \"Student Conduct Office\", \"type\": \"Institution\"},\n {\"name\": \"March 29th\", \"type\": \"Date\"},\n {\"name\": \"April 19-May 3\", \"type\": \"Date Range\"},\n {\"name\": \"May 7\", \"type\": \"Date\"},\n {\"name\": \"academic integrity\", \"type\": \"Policy\"},\n {\"name\": \"attendance policy\", \"type\": \"Policy\"},\n {\"name\": \"public health policy\", \"type\": \"Policy\"}\n ],\n \"edges\": [\n {\"subject\": \"Design Intents DDF 320-01\", \"predicate\": \"is taught by\", \"object\": \"Aaron Nelson\"},\n {\"subject\": \"Design Intents DDF 320-01\", \"predicate\": \"is offered in\", \"object\": \"Spring 2021\"},\n {\"subject\": \"Design Intents DDF 320-01\", \"predicate\": \"is held at\", \"object\": \"EIH 104\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"has office hours\", \"object\": \"M-R 10am \u2013 12pm\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"can be contacted at\", \"object\": \"nelsona@newpaltz.edu\"},\n {\"subject\": \"Design Intents DDF 320-01\", \"predicate\": \"introduces\", \"object\": \"collaborative team research\"},\n {\"subject\": \"Design Intents DDF 320-01\", \"predicate\": \"introduces\", \"object\": \"interdisciplinary practices\"},\n {\"subject\": \"Design Intents DDF 320-01\", \"predicate\": \"covers\", \"object\": \"human-centered design\"},\n {\"subject\": \"Design Intents DDF 32",
|
||||
"extraction_b": "```json\n{\n \"entities\": [\n {\"name\": \"Design Intents\", \"type\": \"course\"},\n {\"name\": \"DDF 320-01\", \"type\": \"course_code\"},\n {\"name\": \"Spring 2021\", \"type\": \"semester\"},\n {\"name\": \"Aaron Nelson\", \"type\": \"person\"},\n {\"name\": \"EIH 104\", \"type\": \"location\"},\n {\"name\": \"HVAMC\", \"type\": \"location\"},\n {\"name\": \"nelsona@newpaltz.edu\", \"type\": \"email\"},\n {\"name\": \"SUNY New Paltz\", \"type\": \"institution\"},\n {\"name\": \"human-centered design\", \"type\": \"design_concept\"},\n {\"name\": \"prototype driven design practice\", \"type\": \"design_concept\"},\n {\"name\": \"fail fast\", \"type\": \"design_concept\"},\n {\"name\": \"mindfulness of process\", \"type\": \"design_concept\"},\n {\"name\": \"design processes\", \"type\": \"topic\"},\n {\"name\": \"innovation methodologies\", \"type\": \"topic\"},\n {\"name\": \"need finding\", \"type\": \"topic\"},\n {\"name\": \"human factors\", \"type\": \"topic\"},\n {\"name\": \"visualization\", \"type\": \"topic\"},\n {\"name\": \"rapid prototyping\", \"type\": \"topic\"},\n {\"name\": \"team dynamics\", \"type\": \"topic\"},\n {\"name\": \"storytelling\", \"type\": \"topic\"},\n {\"name\": \"project leadership\", \"type\": \"topic\"},\n {\"name\": \"design studio\", \"type\": \"course_format\"},\n {\"name\": \"design research\", \"type\": \"skill\"},\n {\"name\": \"iterative ideation\", \"type\": \"skill\"},\n {\"name\": \"presentation\", \"type\": \"skill\"},\n {\"name\": \"DDF minor\", \"type\": \"academic_program\"},\n {\"name\": \"Blackboard\", \"type\": \"platform\"},\n {\"name\": \"Homeworks\", \"type\": \"assignment_type\"},\n {\"name\": \"Reading Responses\", \"type\": \"assignment_type\"},\n {\"name\": \"Group Presentations\", \"type\": \"assignment_type\"},\n {\"name\": \"Quizzes\", \"type\": \"assignment_type\"},\n {\"name\": \"Class Discussion\", \"type\": \"assignment_type\"},\n {\"name\": \"Midterm Presentation\", \"type\": \"assignment_type\"},\n {\"name\": \"In Progress Design Review\", \"type\": \"assignment_type\"},\n {\"name\": \"Final Prototype Presentations\", \"type\": \"assignment_type\"},\n {\"name\": \"Selux\", \"type\": \"client\"},\n {\"name\": \"March 29th\", \"type\": \"date\"},\n {\"name\": \"Disability Resource Center\", \"type\": \"institution\"},\n {\"name\": \"Student Union\", \"type\": \"location\"},\n {\"name\": \"Deanna Knapp\", \"type\": \"person\"},\n {\"name\": \"knappd@newpaltz.edu\", \"type\": \"email\"},\n {\"name\": \"Jean Vizvary\", \"type\": \"person\"},\n {\"name\": \"vizvaryj@newpaltz.edu\", \"type\": \"email\"},\n {\"name\": \"Office of Veteran and Military Services\", \"type\": \"institution\"},\n {\"name\": \"OVMS\", \"type\": \"institution\"},\n {\"name\": \"Student Evaluation of Instruction\", \"type\": \"assessment\"},\n {\"name\": \"SEI\", \"type\": \"assessment\"},\n {\"name\": \"April 19-May 3\", \"type\": \"date_range\"},\n {\"name\": \"Don Norman\", \"type\": \"person\"},\n {\"name\": \"Design Thinking\", \"type\": \"topic\"},\n {\"name\": \"Dieter Rams\", \"type\": \"person\"},\n {\"name\": \"Principals for Good Design\", \"type\": \"design_concept\"},\n {\"name\": \"IDEO\", \"type\": \"organization\"},\n {\"name\": \"10 Step Design Process\", \"type\": \"framework\"},\n {\"name\": \"Frameworks for Design\", \"type\": \"topic\"},\n {\"name\": \"Problem Statement\", \"type\": \"deliverable\"},\n {\"name\": \"Brainstorming Sessions\", \"type\": \"activity\"},\n {\"name\": \"Stakeholder Maps\", \"type\": \"deliverable\"},\n {\"name\": \"Concept Generation\", \"type\": \"activity\"},\n {\"name\": \"Storyboards\", \"type\": \"deliverable\"},\n {\"name\": \"Mockups\", \"type\": \"deliverable\"},\n {\"name\": \"Pitch\", \"type\": \"deliverable\"},\n {\"name\": \"Group Assessment 1\", \"type\": \"assessment\"},\n {\"name\": \"Group Assessment 2\", \"type\": \"assessment\"},\n {\"name\": \"Group Assessment 3\", \"type\": \"assessment\"},\n {\"name\": \"Sojourner Truth Library\", \"type\": \"institution\"},\n {\"name\": \"Student Conduct Office\", \"type\": \"institution\"},\n {\"name\": \"Computer Services\", \"type\": \"institution\"},\n {\"name\": \"Acceptable Uses and Privacy Policy\", \"type\": \"policy\"},\n {\"name\": \"Academic Integrity Policy\", \"type\": \"policy\"},\n {\"name\": \"Student Conduct Code\", \"type\": \"policy\"},\n {\"name\": \"The Pledge\", \"type\": \"policy\"},\n {",
|
||||
"metadata_b": {
|
||||
"language": "en",
|
||||
"char_length": 1824,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": true,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": true,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": true,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "This document outlines a course syllabus for a design-focused class in Spring 2021."
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "2016 - DDF 205 - CAD I Syllabus.pdf",
|
||||
"size_bucket": "large",
|
||||
"doc_chars": 11012,
|
||||
"metrics_a": {
|
||||
"n_entities": 23,
|
||||
"n_edges": 21,
|
||||
"predicate_diversity": 15,
|
||||
"type_diversity": 13,
|
||||
"avg_degree": 1.83,
|
||||
"largest_component": 18,
|
||||
"largest_component_pct": 78.3
|
||||
},
|
||||
"metrics_b": {
|
||||
"n_entities": 61,
|
||||
"n_edges": 51,
|
||||
"predicate_diversity": 27,
|
||||
"type_diversity": 24,
|
||||
"avg_degree": 1.67,
|
||||
"largest_component": 40,
|
||||
"largest_component_pct": 65.6
|
||||
},
|
||||
"extraction_a": "```json\n{\n \"entities\": [\n {\"name\": \"Computer Aided Design DDF 205-03\", \"type\": \"Course\"},\n {\"name\": \"Aaron Nelson\", \"type\": \"Instructor\"},\n {\"name\": \"Fall 2016\", \"type\": \"Semester\"},\n {\"name\": \"SAB 100 \u2013 HVMAC\", \"type\": \"Location\"},\n {\"name\": \"Rhino3D\", \"type\": \"Software\"},\n {\"name\": \"3D printing\", \"type\": \"Technology\"},\n {\"name\": \"NURBS geometry\", \"type\": \"Concept\"},\n {\"name\": \"CAD software\", \"type\": \"Software\"},\n {\"name\": \"digital fabrication\", \"type\": \"Skill\"},\n {\"name\": \"Desktop FDM printing\", \"type\": \"Technology\"},\n {\"name\": \"Rhino 5 Level 1 Training Guide\", \"type\": \"Textbook\"},\n {\"name\": \"Worksheets and Homework\", \"type\": \"Assessment\"},\n {\"name\": \"Project 1\", \"type\": \"Assessment\"},\n {\"name\": \"Midterm\", \"type\": \"Assessment\"},\n {\"name\": \"Project 2\", \"type\": \"Assessment\"},\n {\"name\": \"Project 3\", \"type\": \"Assessment\"},\n {\"name\": \"Attendance and Participation\", \"type\": \"Assessment\"},\n {\"name\": \"Disability Resource Center\", \"type\": \"Institution\"},\n {\"name\": \"Deanna Knapp\", \"type\": \"Person\"},\n {\"name\": \"Sojourner Truth Library\", \"type\": \"Institution\"},\n {\"name\": \"Office of Veteran and Military Services\", \"type\": \"Institution\"},\n {\"name\": \"SUNY New Paltz\", \"type\": \"Institution\"},\n {\"name\": \"Student Evaluation of Instruction\", \"type\": \"Process\"}\n ],\n \"edges\": [\n {\"subject\": \"Computer Aided Design DDF 205-03\", \"predicate\": \"is taught by\", \"object\": \"Aaron Nelson\"},\n {\"subject\": \"Computer Aided Design DDF 205-03\", \"predicate\": \"is offered in\", \"object\": \"Fall 2016\"},\n {\"subject\": \"Computer Aided Design DDF 205-03\", \"predicate\": \"meets at\", \"object\": \"SAB 100 \u2013 HVMAC\"},\n {\"subject\": \"Computer Aided Design DDF 205-03\", \"predicate\": \"uses software\", \"object\": \"Rhino3D\"},\n {\"subject\": \"Computer Aided Design DDF 205-03\", \"predicate\": \"teaches technology\", \"object\": \"3D printing\"},\n {\"subject\": \"Computer Aided Design DDF 205-03\", \"predicate\": \"covers concept\", \"object\": \"NURBS geometry\"},\n {\"subject\": \"Computer Aided Design DDF 205-03\", \"predicate\": \"uses textbook\", \"object\": \"Rhino 5 Level 1 Training Guide\"},\n {\"subject\": \"Computer Aided Design DDF 205-03\", \"predicate\": \"develops skill\", \"object\": \"digital fabrication\"},\n {\"subject\": \"Computer Aided Design DDF 205-03\", \"predicate\": \"teaches technology\", \"object\": \"Desktop FDM printing\"},\n {\"subject\": \"Computer Aided Design DDF 205-03\", \"predicate\": \"assesses through\", \"object\": \"Worksheets and Homework\"},\n {\"subject\": \"Computer Aided Design DDF 205-03\", \"predicate\": \"assesses through\", \"object\": \"Project 1\"},\n {\"subject\": \"Computer Aided Design DDF 205-03\", \"predicate\": \"assesses through\", \"object\": \"Midterm\"},\n {\"subject\": \"Computer Aided Design DDF 205-03\", \"predicate\": \"assesses through\", \"object\": \"Project 2\"},\n {\"subject\": \"Computer Aided Design DDF 205-03\", \"predicate\": \"assesses through\", \"object\": \"Project 3\"},\n {\"subject\": \"Computer Aided Design DDF 205-03\", \"predicate\": \"assesses through\", \"object\": \"Attendance and Participation\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"has office hours\", \"object\": \"M-R 11am \u2013 12pm\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"email\", \"object\": \"nelsona@newpaltz.edu\"},\n {\"subject\": \"Disability Resource Center\", \"predicate\": \"provides accommodations for\", \"object\": \"Computer Aided Design DDF 205-03\"},\n {\"subject\": \"Deanna Knapp\", \"predicate\": \"works at\", \"object\": \"Disability Resource Center\"},\n {\"subject\": \"Sojourner Truth Library\", \"predicate\": \"provides resources on\", \"object\": \"plagiarism\"},\n {\"subject\": \"SUNY New Paltz\", \"predicate\": \"operates\", \"object\": \"Office of Veteran and Military Services\"}\n ]\n}\n```",
|
||||
"extraction_b": "```json\n{\n \"entities\": [\n {\"name\": \"Computer Aided Design\", \"type\": \"Course\"},\n {\"name\": \"DDF 205-03\", \"type\": \"Course Code\"},\n {\"name\": \"Fall 2016\", \"type\": \"Semester\"},\n {\"name\": \"Aaron Nelson\", \"type\": \"Person\"},\n {\"name\": \"SAB 100\", \"type\": \"Location\"},\n {\"name\": \"HVMAC\", \"type\": \"Location\"},\n {\"name\": \"nelsona@newpaltz.edu\", \"type\": \"Email\"},\n {\"name\": \"Rhino3D\", \"type\": \"Software\"},\n {\"name\": \"CAD software\", \"type\": \"Technology\"},\n {\"name\": \"NURBS geometry\", \"type\": \"Concept\"},\n {\"name\": \"3D printing\", \"type\": \"Technology\"},\n {\"name\": \"Rhino 5 Level 1 Training Guide\", \"type\": \"Textbook\"},\n {\"name\": \"digital fabrication software\", \"type\": \"Technology\"},\n {\"name\": \"digital fabrication hardware\", \"type\": \"Technology\"},\n {\"name\": \"spatial awareness\", \"type\": \"Learning Outcome\"},\n {\"name\": \"scales\", \"type\": \"Concept\"},\n {\"name\": \"measuring devices\", \"type\": \"Tool\"},\n {\"name\": \"three dimensional form\", \"type\": \"Concept\"},\n {\"name\": \"digital modeling\", \"type\": \"Concept\"},\n {\"name\": \"three dimensional design principals\", \"type\": \"Concept\"},\n {\"name\": \"Desktop FDM printing\", \"type\": \"Technology\"},\n {\"name\": \"tolerances\", \"type\": \"Concept\"},\n {\"name\": \"critical thinking\", \"type\": \"Skill\"},\n {\"name\": \"problem solving skills\", \"type\": \"Skill\"},\n {\"name\": \"Worksheets and Homework\", \"type\": \"Assessment\"},\n {\"name\": \"Project 1\", \"type\": \"Assessment\"},\n {\"name\": \"Midterm\", \"type\": \"Assessment\"},\n {\"name\": \"Project 2\", \"type\": \"Assessment\"},\n {\"name\": \"Project 3\", \"type\": \"Assessment\"},\n {\"name\": \"Attendance and Participation\", \"type\": \"Assessment\"},\n {\"name\": \"Grade A\", \"type\": \"Grade\"},\n {\"name\": \"Grade B\", \"type\": \"Grade\"},\n {\"name\": \"Grade C\", \"type\": \"Grade\"},\n {\"name\": \"Grade D\", \"type\": \"Grade\"},\n {\"name\": \"Grade F\", \"type\": \"Grade\"},\n {\"name\": \"November 4th\", \"type\": \"Date\"},\n {\"name\": \"Academic Integrity\", \"type\": \"Policy\"},\n {\"name\": \"Cheating\", \"type\": \"Violation\"},\n {\"name\": \"Forgery\", \"type\": \"Violation\"},\n {\"name\": \"Plagiarism\", \"type\": \"Violation\"},\n {\"name\": \"New Paltz\", \"type\": \"Institution\"},\n {\"name\": \"Disability Resource Center\", \"type\": \"Organization\"},\n {\"name\": \"Student Union\", \"type\": \"Location\"},\n {\"name\": \"Room 210\", \"type\": \"Location\"},\n {\"name\": \"Deanna Knapp\", \"type\": \"Person\"},\n {\"name\": \"knappd@newpaltz.edu\", \"type\": \"Email\"},\n {\"name\": \"Accommodation Memo\", \"type\": \"Document\"},\n {\"name\": \"Computer Services\", \"type\": \"Organization\"},\n {\"name\": \"Acceptable Uses and Privacy Policy\", \"type\": \"Policy\"},\n {\"name\": \"Student Evaluation of Instruction\", \"type\": \"Assessment\"},\n {\"name\": \"SEI\", \"type\": \"Assessment\"},\n {\"name\": \"November 30\u2013December 10\", \"type\": \"Date Range\"},\n {\"name\": \"April 21-May 5\", \"type\": \"Date Range\"},\n {\"name\": \"Office of Veteran and Military Services\", \"type\": \"Organization\"},\n {\"name\": \"OVMS\", \"type\": \"Organization\"},\n {\"name\": \"SUNY New Paltz\", \"type\": \"Institution\"},\n {\"name\": \"Sojourner Truth Library\", \"type\": \"Organization\"},\n {\"name\": \"M-R 9:30 \u2013 10:45 am\", \"type\": \"Schedule\"},\n {\"name\": \"M-R 11am \u2013 12pm\", \"type\": \"Schedule\"},\n {\"name\": \"3 Credit Hours\", \"type\": \"Credit\"},\n {\"name\": \"rapid manufacturing\", \"type\": \"Concept\"}\n ],\n \"edges\": [\n {\"subject\": \"Computer Aided Design\", \"predicate\": \"has course code\", \"object\": \"DDF 205-03\"},\n {\"subject\": \"Computer Aided Design\", \"predicate\": \"offered in\", \"object\": \"Fall 2016\"},\n {\"subject\": \"Computer Aided Design\", \"predicate\": \"taught by\", \"object\": \"Aaron Nelson\"},\n {\"subject\": \"Computer Aided Design\", \"predicate\": \"meets at location\", \"object\": \"SAB 100\"},\n {\"subject\": \"Computer Aided Design\", \"predicate\": \"meets at location\", \"object\": \"HVMAC\"},\n {\"subject\": \"Computer Aided Design\", \"predicate\": \"has schedule\", \"object\": \"M-R 9:30 \u2013 10:45 am\"},\n {\"subject\": \"Computer Aided Design\", \"predicate\": \"worth\", \"object\": \"3 Credit Hours\"},\n ",
|
||||
"metadata_b": {
|
||||
"language": "en",
|
||||
"char_length": 962,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": true,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "This document outlines a Computer Aided Design course for Fall 2016, including objectives, student learning outcomes, grading, attendance policy, and resources."
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "04_ALL_Materials and Their Properties_v3.pptx",
|
||||
"size_bucket": "large",
|
||||
"doc_chars": 12000,
|
||||
"metrics_a": {
|
||||
"n_entities": 57,
|
||||
"n_edges": 81,
|
||||
"predicate_diversity": 30,
|
||||
"type_diversity": 15,
|
||||
"avg_degree": 2.84,
|
||||
"largest_component": 8,
|
||||
"largest_component_pct": 14.0
|
||||
},
|
||||
"metrics_b": {
|
||||
"n_entities": 130,
|
||||
"n_edges": 169,
|
||||
"predicate_diversity": 56,
|
||||
"type_diversity": 34,
|
||||
"avg_degree": 2.6,
|
||||
"largest_component": 58,
|
||||
"largest_component_pct": 44.6
|
||||
},
|
||||
"extraction_a": "```json\n{\n \"entities\": [\n {\"name\": \"PolyJet\", \"type\": \"3D Printing Technology\"},\n {\"name\": \"FDM\", \"type\": \"3D Printing Technology\"},\n {\"name\": \"PLA\", \"type\": \"FDM Material\"},\n {\"name\": \"ABS-PLUS\", \"type\": \"FDM Material\"},\n {\"name\": \"ABSM30\", \"type\": \"FDM Material\"},\n {\"name\": \"ABSI\", \"type\": \"FDM Material\"},\n {\"name\": \"ABS M30I\", \"type\": \"FDM Material\"},\n {\"name\": \"ABS-ESD7\", \"type\": \"FDM Material\"},\n {\"name\": \"ASA\", \"type\": \"FDM Material\"},\n {\"name\": \"NYLON 12\", \"type\": \"FDM Material\"},\n {\"name\": \"NYLON 6\", \"type\": \"FDM Material\"},\n {\"name\": \"PC-ABS\", \"type\": \"FDM Material\"},\n {\"name\": \"PC-ISO\", \"type\": \"FDM Material\"},\n {\"name\": \"ULTEM 9085\", \"type\": \"FDM Material\"},\n {\"name\": \"ULTEM 1010\", \"type\": \"FDM Material\"},\n {\"name\": \"Veros\", \"type\": \"PolyJet Material\"},\n {\"name\": \"Rigur\", \"type\": \"PolyJet Material\"},\n {\"name\": \"Durus\", \"type\": \"PolyJet Material\"},\n {\"name\": \"Tango\", \"type\": \"PolyJet Material\"},\n {\"name\": \"Agilus\", \"type\": \"PolyJet Material\"},\n {\"name\": \"Digital ABS\", \"type\": \"PolyJet Material\"},\n {\"name\": \"VEROCLEAR FULLCURE810\", \"type\": \"PolyJet Material\"},\n {\"name\": \"Thermal Resistance\", \"type\": \"ASTM Testing Standard\"},\n {\"name\": \"Tensile Strength\", \"type\": \"ASTM Testing Standard\"},\n {\"name\": \"Flexural Strength\", \"type\": \"ASTM Testing Standard\"},\n {\"name\": \"IZOD Impact\", \"type\": \"ASTM Testing Standard\"},\n {\"name\": \"Coefficient of Thermal Expansion\", \"type\": \"ASTM Testing Standard\"},\n {\"name\": \"Electrical Properties\", \"type\": \"ASTM Testing Standard\"},\n {\"name\": \"Water Absorption\", \"type\": \"ASTM Testing Standard\"},\n {\"name\": \"Shore Hardness\", \"type\": \"ASTM Testing Standard\"},\n {\"name\": \"Heat Deflection Temperature\", \"type\": \"Material Property\"},\n {\"name\": \"Tensile Modulus\", \"type\": \"Material Property\"},\n {\"name\": \"Elongation to Break\", \"type\": \"Material Property\"},\n {\"name\": \"Toughness\", \"type\": \"Material Property\"},\n {\"name\": \"Flexural Modulus\", \"type\": \"Material Property\"},\n {\"name\": \"Static Dissipative\", \"type\": \"Electrical Property\"},\n {\"name\": \"Surface Resistance\", \"type\": \"Electrical Property\"},\n {\"name\": \"Volume Resistance\", \"type\": \"Electrical Property\"},\n {\"name\": \"Shore A\", \"type\": \"Hardness Scale\"},\n {\"name\": \"Shore D\", \"type\": \"Hardness Scale\"},\n {\"name\": \"SR Support Structure\", \"type\": \"Support Structure Type\"},\n {\"name\": \"BASS Support Structure\", \"type\": \"Support Structure Type\"},\n {\"name\": \"Materials Data Sheet\", \"type\": \"Documentation\"},\n {\"name\": \"Safety Data Sheet\", \"type\": \"Documentation\"},\n {\"name\": \"Stratasys\", \"type\": \"Company\"},\n {\"name\": \"Thermoplastic\", \"type\": \"Material Category\"},\n {\"name\": \"Rigid Opaque\", \"type\": \"PolyJet Family\"},\n {\"name\": \"Rigid Transparent\", \"type\": \"PolyJet Family\"},\n {\"name\": \"Simulated Polypropylene Family\", \"type\": \"PolyJet Family\"},\n {\"name\": \"Flexible Rubber-like Family\", \"type\": \"PolyJet Family\"},\n {\"name\": \"Digital Material\", \"type\": \"PolyJet Family\"},\n {\"name\": \"Standard FDM Materials\", \"type\": \"FDM Material Category\"},\n {\"name\": \"Engineering FDM Materials\", \"type\": \"FDM Material Category\"},\n {\"name\": \"High Performance FDM Materials\", \"type\": \"FDM Material Category\"},\n {\"name\": \"Glass Transition Temperature\", \"type\": \"Material Property\"},\n {\"name\": \"REACH\", \"type\": \"Regulation\"},\n {\"name\": \"Global Harmonized System\", \"type\": \"Standard\"}\n ],\n \"edges\": [\n {\"subject\": \"PolyJet\", \"predicate\": \"offers advantages including\", \"object\": \"Clear transparency\"},\n {\"subject\": \"PolyJet\", \"predicate\": \"offers advantages including\", \"object\": \"Multi-material realism\"},\n {\"subject\": \"PolyJet\", \"predicate\": \"offers advantages including\", \"object\": \"Smooth surface finish\"},\n {\"subject\": \"FDM\", \"predicate\": \"offers advantages including\", \"object\": \"ABS functionality\"},\n {\"subject\": \"FDM\", \"predicate\": \"offers advantages including\", \"object\": \"Economic sparse fill\"},\n {\"subject\": \"FDM\", \"p",
|
||||
"extraction_b": "```json\n{\n \"entities\": [\n {\"name\": \"Stratasys\", \"type\": \"Company\"},\n {\"name\": \"PolyJet\", \"type\": \"Technology\"},\n {\"name\": \"FDM\", \"type\": \"Technology\"},\n {\"name\": \"PLA\", \"type\": \"Material\"},\n {\"name\": \"ABS-PLUS\", \"type\": \"Material\"},\n {\"name\": \"ABSM30\", \"type\": \"Material\"},\n {\"name\": \"ABSI\", \"type\": \"Material\"},\n {\"name\": \"ABS M30I\", \"type\": \"Material\"},\n {\"name\": \"ABS-ESD7\", \"type\": \"Material\"},\n {\"name\": \"ASA\", \"type\": \"Material\"},\n {\"name\": \"NYLON 12\", \"type\": \"Material\"},\n {\"name\": \"NYLON 6\", \"type\": \"Material\"},\n {\"name\": \"PC-ABS\", \"type\": \"Material\"},\n {\"name\": \"PC-ISO\", \"type\": \"Material\"},\n {\"name\": \"ULTEM 9085\", \"type\": \"Material\"},\n {\"name\": \"ULTEM 1010\", \"type\": \"Material\"},\n {\"name\": \"Veros\", \"type\": \"Material\"},\n {\"name\": \"Rigur\", \"type\": \"Material\"},\n {\"name\": \"Durus\", \"type\": \"Material\"},\n {\"name\": \"Tango\", \"type\": \"Material\"},\n {\"name\": \"Agilus\", \"type\": \"Material\"},\n {\"name\": \"Digital ABS\", \"type\": \"Material\"},\n {\"name\": \"ABSplus\", \"type\": \"Material\"},\n {\"name\": \"VEROCLEAR FULLCURE810\", \"type\": \"Material\"},\n {\"name\": \"ANTERO\", \"type\": \"Material\"},\n {\"name\": \"PC\", \"type\": \"Material\"},\n {\"name\": \"NYLON CF\", \"type\": \"Material\"},\n {\"name\": \"Thermoplastics\", \"type\": \"Material Class\"},\n {\"name\": \"Rigid Opaque\", \"type\": \"Material Family\"},\n {\"name\": \"Rigid Transparent\", \"type\": \"Material Family\"},\n {\"name\": \"Simulated Polypropylene Family\", \"type\": \"Material Family\"},\n {\"name\": \"Flexible Rubber-like Family\", \"type\": \"Material Family\"},\n {\"name\": \"Digital Material\", \"type\": \"Material Family\"},\n {\"name\": \"Standard Materials\", \"type\": \"Material Category\"},\n {\"name\": \"Engineering Materials\", \"type\": \"Material Category\"},\n {\"name\": \"High Performance Materials\", \"type\": \"Material Category\"},\n {\"name\": \"SR support structures\", \"type\": \"Support Structure\"},\n {\"name\": \"BASS support structures\", \"type\": \"Support Structure\"},\n {\"name\": \"Thermal Resistance\", \"type\": \"Testing Standard\"},\n {\"name\": \"Heat Deflection Temperature\", \"type\": \"Testing Standard\"},\n {\"name\": \"HDT\", \"type\": \"Testing Standard\"},\n {\"name\": \"Tensile Strength\", \"type\": \"Testing Standard\"},\n {\"name\": \"Tensile Modulus\", \"type\": \"Testing Standard\"},\n {\"name\": \"Elongation to Break\", \"type\": \"Testing Standard\"},\n {\"name\": \"Toughness\", \"type\": \"Testing Standard\"},\n {\"name\": \"Flexural Strength\", \"type\": \"Testing Standard\"},\n {\"name\": \"Flexural Modulus\", \"type\": \"Testing Standard\"},\n {\"name\": \"IZOD Impact\", \"type\": \"Testing Standard\"},\n {\"name\": \"Coefficient of Thermal Expansion\", \"type\": \"Testing Standard\"},\n {\"name\": \"CTE\", \"type\": \"Testing Standard\"},\n {\"name\": \"Electrical Properties\", \"type\": \"Testing Standard\"},\n {\"name\": \"Static Dissipative\", \"type\": \"Electrical Property\"},\n {\"name\": \"Surface Resistance\", \"type\": \"Electrical Property\"},\n {\"name\": \"Volume Resistance\", \"type\": \"Electrical Property\"},\n {\"name\": \"Water Absorption\", \"type\": \"Testing Standard\"},\n {\"name\": \"Shore Hardness\", \"type\": \"Testing Standard\"},\n {\"name\": \"Shore A\", \"type\": \"Hardness Scale\"},\n {\"name\": \"Shore D\", \"type\": \"Hardness Scale\"},\n {\"name\": \"ASTM\", \"type\": \"Standards Organization\"},\n {\"name\": \"Materials Data Sheet\", \"type\": \"Document Type\"},\n {\"name\": \"Safety Data Sheet\", \"type\": \"Document Type\"},\n {\"name\": \"SDS\", \"type\": \"Document Type\"},\n {\"name\": \"REACH\", \"type\": \"Regulation\"},\n {\"name\": \"Global Harmonized System\", \"type\": \"Standard System\"},\n {\"name\": \"GHS\", \"type\": \"Standard System\"},\n {\"name\": \"Glass Transition Temperature\", \"type\": \"Property\"},\n {\"name\": \"Design Series 3D Printers\", \"type\": \"Equipment\"},\n {\"name\": \"Stress-Strain Graph\", \"type\": \"Testing Method\"},\n {\"name\": \"Two-point Bending\", \"type\": \"Testing Method\"},\n {\"name\": \"Pendulum Impact Test\", \"type\": \"Testing Method\"},\n {\"name\": \"Medical Applications\", \"type\": \"Application Domain\"},\n {\"name\": \"Dental Applic",
|
||||
"metadata_b": {
|
||||
"language": "en",
|
||||
"char_length": 3204,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": true,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "technical",
|
||||
"one_sentence_summary": "This document discusses Stratasys materials and their properties, testing standards, and applications in FDM and PolyJet technologies."
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "02_PPT_ALL_AM_Technologies_for_3DP_v3.pptx",
|
||||
"size_bucket": "large",
|
||||
"doc_chars": 9360,
|
||||
"metrics_a": {
|
||||
"n_entities": 67,
|
||||
"n_edges": 128,
|
||||
"predicate_diversity": 11,
|
||||
"type_diversity": 9,
|
||||
"avg_degree": 3.82,
|
||||
"largest_component": 49,
|
||||
"largest_component_pct": 73.1
|
||||
},
|
||||
"metrics_b": {
|
||||
"n_entities": 105,
|
||||
"n_edges": 158,
|
||||
"predicate_diversity": 27,
|
||||
"type_diversity": 11,
|
||||
"avg_degree": 3.01,
|
||||
"largest_component": 72,
|
||||
"largest_component_pct": 68.6
|
||||
},
|
||||
"extraction_a": "```json\n{\n \"entities\": [\n {\"name\": \"Additive Manufacturing\", \"type\": \"Technology Domain\"},\n {\"name\": \"ASTM\", \"type\": \"Organization\"},\n {\"name\": \"Material Extrusion\", \"type\": \"AM Process\"},\n {\"name\": \"FDM\", \"type\": \"AM Technology\"},\n {\"name\": \"Fused Deposition Modeling\", \"type\": \"AM Technology\"},\n {\"name\": \"FFF\", \"type\": \"AM Technology\"},\n {\"name\": \"Fused Filament Fabrication\", \"type\": \"AM Technology\"},\n {\"name\": \"Vat Photopolymerization\", \"type\": \"AM Process\"},\n {\"name\": \"SL\", \"type\": \"AM Technology\"},\n {\"name\": \"SLA\", \"type\": \"AM Technology\"},\n {\"name\": \"Stereolithography\", \"type\": \"AM Technology\"},\n {\"name\": \"DLP\", \"type\": \"AM Technology\"},\n {\"name\": \"Digital Light Processing\", \"type\": \"AM Technology\"},\n {\"name\": \"3SP\", \"type\": \"AM Technology\"},\n {\"name\": \"Powder Bed Fusion\", \"type\": \"AM Process\"},\n {\"name\": \"SLS\", \"type\": \"AM Technology\"},\n {\"name\": \"Selective Laser Sintering\", \"type\": \"AM Technology\"},\n {\"name\": \"DMLS\", \"type\": \"AM Technology\"},\n {\"name\": \"Direct Metal Laser Sintering\", \"type\": \"AM Technology\"},\n {\"name\": \"EBM\", \"type\": \"AM Technology\"},\n {\"name\": \"Electron Beam Melting\", \"type\": \"AM Technology\"},\n {\"name\": \"SHS\", \"type\": \"AM Technology\"},\n {\"name\": \"Selective Heat Sintering\", \"type\": \"AM Technology\"},\n {\"name\": \"Binder Jetting\", \"type\": \"AM Process\"},\n {\"name\": \"CJP\", \"type\": \"AM Technology\"},\n {\"name\": \"ColorJet Printing\", \"type\": \"AM Technology\"},\n {\"name\": \"PP\", \"type\": \"AM Technology\"},\n {\"name\": \"Plaster-based 3D Printing\", \"type\": \"AM Technology\"},\n {\"name\": \"Sheet Lamination\", \"type\": \"AM Process\"},\n {\"name\": \"UC\", \"type\": \"AM Technology\"},\n {\"name\": \"Ultrasonic Consolidation\", \"type\": \"AM Technology\"},\n {\"name\": \"LOM\", \"type\": \"AM Technology\"},\n {\"name\": \"Laminated Object Manufacturing\", \"type\": \"AM Technology\"},\n {\"name\": \"Material Jetting\", \"type\": \"AM Process\"},\n {\"name\": \"MJP\", \"type\": \"AM Technology\"},\n {\"name\": \"PolyJet\", \"type\": \"AM Technology\"},\n {\"name\": \"Photopolymer Jetting\", \"type\": \"AM Technology\"},\n {\"name\": \"PJ\", \"type\": \"AM Technology\"},\n {\"name\": \"MultiJet Printing\", \"type\": \"AM Technology\"},\n {\"name\": \"Directed Energy Deposition\", \"type\": \"AM Process\"},\n {\"name\": \"LMD\", \"type\": \"AM Technology\"},\n {\"name\": \"Laser Metal Deposition\", \"type\": \"AM Technology\"},\n {\"name\": \"Nylon\", \"type\": \"Material\"},\n {\"name\": \"Photopolymers\", \"type\": \"Material\"},\n {\"name\": \"Thermoplastics\", \"type\": \"Material\"},\n {\"name\": \"Metals\", \"type\": \"Material\"},\n {\"name\": \"UV-active photopolymers\", \"type\": \"Material\"},\n {\"name\": \"Prototypes\", \"type\": \"Application\"},\n {\"name\": \"Manufacturing Aides\", \"type\": \"Application\"},\n {\"name\": \"Small series parts\", \"type\": \"Application\"},\n {\"name\": \"Casting patterns\", \"type\": \"Application\"},\n {\"name\": \"Green parts\", \"type\": \"Application\"},\n {\"name\": \"Molds and cores\", \"type\": \"Application\"},\n {\"name\": \"Tools for injection molding\", \"type\": \"Application\"},\n {\"name\": \"Tools for injection molds\", \"type\": \"Application\"},\n {\"name\": \"Support parts\", \"type\": \"Application\"},\n {\"name\": \"Lost wax casting\", \"type\": \"Application\"},\n {\"name\": \"Jewelry\", \"type\": \"Application\"},\n {\"name\": \"Dental\", \"type\": \"Application\"},\n {\"name\": \"Stratasys\", \"type\": \"Company\"},\n {\"name\": \"Eden 1\", \"type\": \"Product\"},\n {\"name\": \"Connex 2 Materials\", \"type\": \"Product\"},\n {\"name\": \"Connex 1/2/3\", \"type\": \"Product\"},\n {\"name\": \"J750\", \"type\": \"Product\"},\n {\"name\": \"Module 2\", \"type\": \"Course Module\"},\n {\"name\": \"Module 3\", \"type\": \"Course Module\"},\n {\"name\": \"Module 4\", \"type\": \"Course Module\"}\n ],\n \"edges\": [\n {\"subject\": \"ASTM\", \"predicate\": \"defines\", \"object\": \"Material Extrusion\"},\n {\"subject\": \"ASTM\", \"predicate\": \"defines\", \"object\": \"Vat Photopolymerization\"},\n {\"subject\": \"ASTM\", \"predicate\": \"defines\", \"object\": \"Powder Bed Fusion\"},\n {\"subject\": \"ASTM\", \"predicate\": \"defines\", \"obje",
|
||||
"extraction_b": "```json\n{\n \"entities\": [\n {\"name\": \"Additive Manufacturing\", \"type\": \"technology_domain\"},\n {\"name\": \"3D Printing\", \"type\": \"technology_domain\"},\n {\"name\": \"American Society for Testing Materials\", \"type\": \"organization\"},\n {\"name\": \"ASTM\", \"type\": \"organization\"},\n {\"name\": \"Material Extrusion\", \"type\": \"AM_process\"},\n {\"name\": \"FDM\", \"type\": \"AM_technology\"},\n {\"name\": \"Fused Deposition Modeling\", \"type\": \"AM_technology\"},\n {\"name\": \"FFF\", \"type\": \"AM_technology\"},\n {\"name\": \"Fused Filament Fabrication\", \"type\": \"AM_technology\"},\n {\"name\": \"Vat Photopolymerization\", \"type\": \"AM_process\"},\n {\"name\": \"SL\", \"type\": \"AM_technology\"},\n {\"name\": \"SLA\", \"type\": \"AM_technology\"},\n {\"name\": \"Stereolithography\", \"type\": \"AM_technology\"},\n {\"name\": \"DLP\", \"type\": \"AM_technology\"},\n {\"name\": \"Digital Light Processing\", \"type\": \"AM_technology\"},\n {\"name\": \"3SP\", \"type\": \"AM_technology\"},\n {\"name\": \"Scan, Spin, & Selectively Photocure\", \"type\": \"AM_technology\"},\n {\"name\": \"MultiJet Printing\", \"type\": \"AM_technology\"},\n {\"name\": \"Powder Bed Fusion\", \"type\": \"AM_process\"},\n {\"name\": \"SLS\", \"type\": \"AM_technology\"},\n {\"name\": \"Selective Laser Sintering\", \"type\": \"AM_technology\"},\n {\"name\": \"DMLS\", \"type\": \"AM_technology\"},\n {\"name\": \"Direct Metal Laser Sintering\", \"type\": \"AM_technology\"},\n {\"name\": \"EBM\", \"type\": \"AM_technology\"},\n {\"name\": \"Electron Beam Melting\", \"type\": \"AM_technology\"},\n {\"name\": \"SHS\", \"type\": \"AM_technology\"},\n {\"name\": \"Selective Heat Sintering\", \"type\": \"AM_technology\"},\n {\"name\": \"Binder Jetting\", \"type\": \"AM_process\"},\n {\"name\": \"CJP\", \"type\": \"AM_technology\"},\n {\"name\": \"ColorJet Printing\", \"type\": \"AM_technology\"},\n {\"name\": \"PP\", \"type\": \"AM_technology\"},\n {\"name\": \"Plaster-based 3D Printing\", \"type\": \"AM_technology\"},\n {\"name\": \"Sheet Lamination\", \"type\": \"AM_process\"},\n {\"name\": \"UC\", \"type\": \"AM_technology\"},\n {\"name\": \"Ultrasonic Consolidation\", \"type\": \"AM_technology\"},\n {\"name\": \"LOM\", \"type\": \"AM_technology\"},\n {\"name\": \"Laminated Object Manufacturing\", \"type\": \"AM_technology\"},\n {\"name\": \"Directed Energy Deposition\", \"type\": \"AM_process\"},\n {\"name\": \"LMD\", \"type\": \"AM_technology\"},\n {\"name\": \"Laser Metal Deposition\", \"type\": \"AM_technology\"},\n {\"name\": \"Material Jetting\", \"type\": \"AM_process\"},\n {\"name\": \"MJP\", \"type\": \"AM_technology\"},\n {\"name\": \"PJ\", \"type\": \"AM_technology\"},\n {\"name\": \"PolyJet\", \"type\": \"AM_technology\"},\n {\"name\": \"Photopolymer Jetting\", \"type\": \"AM_technology\"},\n {\"name\": \"LM\", \"type\": \"AM_technology\"},\n {\"name\": \"Laser Melting\", \"type\": \"AM_technology\"},\n {\"name\": \"SLM\", \"type\": \"AM_technology\"},\n {\"name\": \"Selective Laser Melting\", \"type\": \"AM_technology\"},\n {\"name\": \"BJ\", \"type\": \"AM_technology\"},\n {\"name\": \"MJ\", \"type\": \"AM_technology\"},\n {\"name\": \"photopolymers\", \"type\": \"material\"},\n {\"name\": \"plastics\", \"type\": \"material\"},\n {\"name\": \"standard plastics\", \"type\": \"material\"},\n {\"name\": \"Nylon\", \"type\": \"material\"},\n {\"name\": \"metals\", \"type\": \"material\"},\n {\"name\": \"standard metals\", \"type\": \"material\"},\n {\"name\": \"thermoplastics\", \"type\": \"material\"},\n {\"name\": \"powder\", \"type\": \"material\"},\n {\"name\": \"wax-like materials\", \"type\": \"material\"},\n {\"name\": \"UV-active photopolymers\", \"type\": \"material\"},\n {\"name\": \"thermoset\", \"type\": \"material\"},\n {\"name\": \"anisotropy\", \"type\": \"characteristic\"},\n {\"name\": \"z-direction\", \"type\": \"characteristic\"},\n {\"name\": \"vertical direction\", \"type\": \"characteristic\"},\n {\"name\": \"step structure\", \"type\": \"characteristic\"},\n {\"name\": \"mechanical properties\", \"type\": \"characteristic\"},\n {\"name\": \"durability\", \"type\": \"characteristic\"},\n {\"name\": \"accuracy\", \"type\": \"characteristic\"},\n {\"name\": \"surface finishes\", \"type\": \"characteristic\"},\n {\"name\": \"build volume\", \"type\": \"characteristic\"},\n {\"name\": \"tolerances\", \"type\": \"charact",
|
||||
"metadata_b": {
|
||||
"language": "en",
|
||||
"char_length": 3068,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": true,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "technical",
|
||||
"one_sentence_summary": "Document discusses seven key Additive Manufacturing technologies and their characteristics"
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "CAD I Syllabus.docx",
|
||||
"size_bucket": "large",
|
||||
"doc_chars": 12000,
|
||||
"metrics_a": {
|
||||
"n_entities": 23,
|
||||
"n_edges": 24,
|
||||
"predicate_diversity": 24,
|
||||
"type_diversity": 18,
|
||||
"avg_degree": 2.09,
|
||||
"largest_component": 16,
|
||||
"largest_component_pct": 69.6
|
||||
},
|
||||
"metrics_b": {
|
||||
"n_entities": 45,
|
||||
"n_edges": 46,
|
||||
"predicate_diversity": 32,
|
||||
"type_diversity": 23,
|
||||
"avg_degree": 2.04,
|
||||
"largest_component": 27,
|
||||
"largest_component_pct": 60.0
|
||||
},
|
||||
"extraction_a": "```json\n{\n \"entities\": [\n {\"name\": \"Computer Aided Design I\", \"type\": \"Course\"},\n {\"name\": \"DDF 205-01\", \"type\": \"Course Code\"},\n {\"name\": \"Aaron Nelson\", \"type\": \"Instructor\"},\n {\"name\": \"Spring 2019\", \"type\": \"Semester\"},\n {\"name\": \"SAB 100\", \"type\": \"Location\"},\n {\"name\": \"HVMAC\", \"type\": \"Building\"},\n {\"name\": \"Rhino3D\", \"type\": \"Software\"},\n {\"name\": \"NURBS geometry\", \"type\": \"Concept\"},\n {\"name\": \"3D printing\", \"type\": \"Technology\"},\n {\"name\": \"Digital fabrication\", \"type\": \"Skill\"},\n {\"name\": \"Desktop FDM printing\", \"type\": \"Technology\"},\n {\"name\": \"Rhino 5 Level 1 Training Guide\", \"type\": \"Textbook\"},\n {\"name\": \"Blackboard\", \"type\": \"Platform\"},\n {\"name\": \"Academic Integrity\", \"type\": \"Policy\"},\n {\"name\": \"Disability Resource Center\", \"type\": \"Institution\"},\n {\"name\": \"Deanna Knapp\", \"type\": \"Person\"},\n {\"name\": \"Jean Vizvary\", \"type\": \"Person\"},\n {\"name\": \"SUNY New Paltz\", \"type\": \"Institution\"},\n {\"name\": \"Office of Veteran and Military Services\", \"type\": \"Institution\"},\n {\"name\": \"Student Evaluation of Instruction\", \"type\": \"Assessment\"},\n {\"name\": \"April 2nd\", \"type\": \"Date\"},\n {\"name\": \"April 24-May 8\", \"type\": \"Date Range\"},\n {\"name\": \"Sojourner Truth Library\", \"type\": \"Institution\"}\n ],\n \"edges\": [\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"has course code\", \"object\": \"DDF 205-01\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"is taught by\", \"object\": \"Aaron Nelson\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"is offered in\", \"object\": \"Spring 2019\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"meets in\", \"object\": \"SAB 100\"},\n {\"subject\": \"SAB 100\", \"predicate\": \"is located in\", \"object\": \"HVMAC\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"teaches software\", \"object\": \"Rhino3D\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"covers concept\", \"object\": \"NURBS geometry\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"uses technology\", \"object\": \"3D printing\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"develops skill\", \"object\": \"Digital fabrication\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"covers technology\", \"object\": \"Desktop FDM printing\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"uses textbook\", \"object\": \"Rhino 5 Level 1 Training Guide\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"requires submission on\", \"object\": \"Blackboard\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"has office hours\", \"object\": \"M-R 10am \u2013 12pm\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"email is\", \"object\": \"nelsona@newpaltz.edu\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"enforces policy\", \"object\": \"Academic Integrity\"},\n {\"subject\": \"Students\", \"predicate\": \"can contact for accommodations\", \"object\": \"Disability Resource Center\"},\n {\"subject\": \"Deanna Knapp\", \"predicate\": \"is Assistant Director of\", \"object\": \"Disability Resource Center\"},\n {\"subject\": \"Jean Vizvary\", \"predicate\": \"is Director of\", \"object\": \"Disability Resource Center\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"is part of\", \"object\": \"SUNY New Paltz\"},\n {\"subject\": \"Students\", \"predicate\": \"can contact for veteran services\", \"object\": \"Office of Veteran and Military Services\"},\n {\"subject\": \"Students\", \"predicate\": \"must complete\", \"object\": \"Student Evaluation of Instruction\"},\n {\"subject\": \"Students\", \"predicate\": \"can drop course without penalty by\", \"object\": \"April 2nd\"},\n {\"subject\": \"Student Evaluation of Instruction\", \"predicate\": \"administered during\", \"object\": \"April 24-May 8\"},\n {\"subject\": \"Sojourner Truth Library\", \"predicate\": \"provides resources on\", \"object\": \"Academic Integrity\"}\n ]\n}\n```",
|
||||
"extraction_b": "```json\n{\n \"entities\": [\n {\"name\": \"Computer Aided Design I\", \"type\": \"Course\"},\n {\"name\": \"DDF 205-01\", \"type\": \"Course Code\"},\n {\"name\": \"Spring 2019\", \"type\": \"Semester\"},\n {\"name\": \"Aaron Nelson\", \"type\": \"Person\"},\n {\"name\": \"SAB 100\", \"type\": \"Location\"},\n {\"name\": \"HVMAC\", \"type\": \"Location\"},\n {\"name\": \"nelsona@newpaltz.edu\", \"type\": \"Email\"},\n {\"name\": \"Rhino3D\", \"type\": \"Software\"},\n {\"name\": \"NURBS geometry\", \"type\": \"Technical Concept\"},\n {\"name\": \"3D printing\", \"type\": \"Technology\"},\n {\"name\": \"Digital fabrication\", \"type\": \"Technical Concept\"},\n {\"name\": \"Desktop FDM printing\", \"type\": \"Technology\"},\n {\"name\": \"Rhino 5 Level 1 Training Guide\", \"type\": \"Text\"},\n {\"name\": \"Blackboard\", \"type\": \"Platform\"},\n {\"name\": \"April 2nd\", \"type\": \"Date\"},\n {\"name\": \"Attendance Policy\", \"type\": \"Policy\"},\n {\"name\": \"Academic Integrity Policy\", \"type\": \"Policy\"},\n {\"name\": \"Disability Resource Center\", \"type\": \"Institution\"},\n {\"name\": \"Student Union\", \"type\": \"Location\"},\n {\"name\": \"Room 210\", \"type\": \"Location\"},\n {\"name\": \"Deanna Knapp\", \"type\": \"Person\"},\n {\"name\": \"knappd@newpaltz.edu\", \"type\": \"Email\"},\n {\"name\": \"Jean Vizvary\", \"type\": \"Person\"},\n {\"name\": \"vizvaryj@newpaltz.edu\", \"type\": \"Email\"},\n {\"name\": \"New Paltz\", \"type\": \"Institution\"},\n {\"name\": \"SUNY New Paltz\", \"type\": \"Institution\"},\n {\"name\": \"Sojourner Truth Library\", \"type\": \"Institution\"},\n {\"name\": \"Office of Veteran and Military Services\", \"type\": \"Institution\"},\n {\"name\": \"Student Evaluation of Instruction\", \"type\": \"Assessment\"},\n {\"name\": \"April 24-May 8\", \"type\": \"Date Range\"},\n {\"name\": \"Grade A\", \"type\": \"Grade\"},\n {\"name\": \"Grade B\", \"type\": \"Grade\"},\n {\"name\": \"Grade C\", \"type\": \"Grade\"},\n {\"name\": \"Grade D\", \"type\": \"Grade\"},\n {\"name\": \"Grade F\", \"type\": \"Grade\"},\n {\"name\": \"3 absences\", \"type\": \"Policy Rule\"},\n {\"name\": \"10% deduction per day\", \"type\": \"Penalty Rule\"},\n {\"name\": \"CAD software\", \"type\": \"Software\"},\n {\"name\": \"Homework\", \"type\": \"Assignment Type\"},\n {\"name\": \"Projects\", \"type\": \"Assignment Type\"},\n {\"name\": \"Accommodation Memo\", \"type\": \"Document\"},\n {\"name\": \"Acceptable Uses and Privacy Policy\", \"type\": \"Policy\"},\n {\"name\": \"M-R 10am \u2013 12pm\", \"type\": \"Time Slot\"},\n {\"name\": \"T-F 9:30 \u2013 10:45 am\", \"type\": \"Time Slot\"},\n {\"name\": \"3 Credit Hours\", \"type\": \"Credit Value\"}\n ],\n \"edges\": [\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"has course code\", \"object\": \"DDF 205-01\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"offered in semester\", \"object\": \"Spring 2019\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"taught by\", \"object\": \"Aaron Nelson\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"meets at location\", \"object\": \"SAB 100\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"meets at location\", \"object\": \"HVMAC\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"meets during time\", \"object\": \"T-F 9:30 \u2013 10:45 am\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"has credit hours\", \"object\": \"3 Credit Hours\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"has email\", \"object\": \"nelsona@newpaltz.edu\"},\n {\"subject\": \"Aaron Nelson\", \"predicate\": \"has office hours\", \"object\": \"M-R 10am \u2013 12pm\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"introduces\", \"object\": \"3D computer aided design drawing\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"introduces\", \"object\": \"rapid manufacturing\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"uses software\", \"object\": \"Rhino3D\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"covers concept\", \"object\": \"NURBS geometry\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"covers technology\", \"object\": \"3D printing\"},\n {\"subject\": \"Computer Aided Design I\", \"predicate\": \"covers technology\", \"object\": \"Desktop FDM printing\"},\n",
|
||||
"metadata_b": {
|
||||
"language": "en",
|
||||
"char_length": 1024,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": true
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": true,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "This document outlines a Computer Aided Design course for Spring 2019, including objectives, texts, grading policy, attendance policy, academic integrity policy, accommodation policy, and computer services policy."
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,114 @@
|
||||
{
|
||||
"results": [
|
||||
{
|
||||
"name": "Claude: Modern website redesign with portfolio and contact form",
|
||||
"bucket": "high",
|
||||
"a_entities": 30,
|
||||
"b_entities": 27,
|
||||
"a_edges": 30,
|
||||
"b_edges": 27,
|
||||
"a_rel_types": 2,
|
||||
"b_rel_types": 2
|
||||
},
|
||||
{
|
||||
"name": "Claude: Preparing for dinner with Jim Agutter",
|
||||
"bucket": "high",
|
||||
"a_entities": 29,
|
||||
"b_entities": 28,
|
||||
"a_edges": 29,
|
||||
"b_edges": 28,
|
||||
"a_rel_types": 2,
|
||||
"b_rel_types": 2
|
||||
},
|
||||
{
|
||||
"name": "Claude: SUNY school closure risk and New Paltz",
|
||||
"bucket": "high",
|
||||
"a_entities": 28,
|
||||
"b_entities": 25,
|
||||
"a_edges": 28,
|
||||
"b_edges": 26,
|
||||
"a_rel_types": 2,
|
||||
"b_rel_types": 2
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Sanity CMS overview",
|
||||
"bucket": "mid",
|
||||
"a_entities": 10,
|
||||
"b_entities": 13,
|
||||
"a_edges": 11,
|
||||
"b_edges": 14,
|
||||
"a_rel_types": 2,
|
||||
"b_rel_types": 2
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Refactor Balance Logic.",
|
||||
"bucket": "mid",
|
||||
"a_entities": 9,
|
||||
"b_entities": 13,
|
||||
"a_edges": 9,
|
||||
"b_edges": 13,
|
||||
"a_rel_types": 2,
|
||||
"b_rel_types": 2
|
||||
},
|
||||
{
|
||||
"name": "Claude: Realtor requirements for home buying in New York",
|
||||
"bucket": "mid",
|
||||
"a_entities": 9,
|
||||
"b_entities": 6,
|
||||
"a_edges": 9,
|
||||
"b_edges": 6,
|
||||
"a_rel_types": 2,
|
||||
"b_rel_types": 2
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Python __name__ Main Explanation",
|
||||
"bucket": "low",
|
||||
"a_entities": 3,
|
||||
"b_entities": 3,
|
||||
"a_edges": 3,
|
||||
"b_edges": 3,
|
||||
"a_rel_types": 2,
|
||||
"b_rel_types": 2
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Push changes to repo",
|
||||
"bucket": "low",
|
||||
"a_entities": 3,
|
||||
"b_entities": 3,
|
||||
"a_edges": 3,
|
||||
"b_edges": 3,
|
||||
"a_rel_types": 2,
|
||||
"b_rel_types": 2
|
||||
},
|
||||
{
|
||||
"name": "Wearable Marquees uw4.pptx",
|
||||
"bucket": "document",
|
||||
"a_entities": 13,
|
||||
"b_entities": 15,
|
||||
"a_edges": 13,
|
||||
"b_edges": 15,
|
||||
"a_rel_types": 2,
|
||||
"b_rel_types": 2
|
||||
},
|
||||
{
|
||||
"name": "Nic Oconnor Field Work F2023 Syllabus.docx",
|
||||
"bucket": "document",
|
||||
"a_entities": 13,
|
||||
"b_entities": 17,
|
||||
"a_edges": 13,
|
||||
"b_edges": 17,
|
||||
"a_rel_types": 2,
|
||||
"b_rel_types": 2
|
||||
}
|
||||
],
|
||||
"aggregate": {
|
||||
"a_entities_total": 147,
|
||||
"b_entities_total": 150,
|
||||
"a_edges_total": 148,
|
||||
"b_edges_total": 152,
|
||||
"global_predicate_diversity": {
|
||||
"a": 2,
|
||||
"b": 2
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,90 @@
|
||||
{
|
||||
"per_source": [
|
||||
{
|
||||
"name": "Claude: Modern website redesign with portfolio and contact form",
|
||||
"bucket": "high",
|
||||
"a_edges": 28,
|
||||
"a_preds": 16,
|
||||
"b_edges": 25,
|
||||
"b_preds": 15
|
||||
},
|
||||
{
|
||||
"name": "Claude: Preparing for dinner with Jim Agutter",
|
||||
"bucket": "high",
|
||||
"a_edges": 16,
|
||||
"a_preds": 11,
|
||||
"b_edges": 21,
|
||||
"b_preds": 15
|
||||
},
|
||||
{
|
||||
"name": "Claude: SUNY school closure risk and New Paltz",
|
||||
"bucket": "high",
|
||||
"a_edges": 23,
|
||||
"a_preds": 15,
|
||||
"b_edges": 16,
|
||||
"b_preds": 12
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Sanity CMS overview",
|
||||
"bucket": "mid",
|
||||
"a_edges": 9,
|
||||
"a_preds": 8,
|
||||
"b_edges": 8,
|
||||
"b_preds": 7
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Refactor Balance Logic.",
|
||||
"bucket": "mid",
|
||||
"a_edges": 1,
|
||||
"a_preds": 1,
|
||||
"b_edges": 9,
|
||||
"b_preds": 8
|
||||
},
|
||||
{
|
||||
"name": "Claude: Realtor requirements for home buying in New York",
|
||||
"bucket": "mid",
|
||||
"a_edges": 7,
|
||||
"a_preds": 4,
|
||||
"b_edges": 7,
|
||||
"b_preds": 4
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Python __name__ Main Explanation",
|
||||
"bucket": "low",
|
||||
"a_edges": 2,
|
||||
"a_preds": 2,
|
||||
"b_edges": 5,
|
||||
"b_preds": 2
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Push changes to repo",
|
||||
"bucket": "low",
|
||||
"a_edges": 2,
|
||||
"a_preds": 2,
|
||||
"b_edges": 2,
|
||||
"b_preds": 2
|
||||
},
|
||||
{
|
||||
"name": "Wearable Marquees uw4.pptx",
|
||||
"bucket": "document",
|
||||
"a_edges": 12,
|
||||
"a_preds": 6,
|
||||
"b_edges": 10,
|
||||
"b_preds": 10
|
||||
},
|
||||
{
|
||||
"name": "Nic Oconnor Field Work F2023 Syllabus.docx",
|
||||
"bucket": "document",
|
||||
"a_edges": 9,
|
||||
"a_preds": 5,
|
||||
"b_edges": 13,
|
||||
"b_preds": 10
|
||||
}
|
||||
],
|
||||
"aggregate": {
|
||||
"a_preds": 70,
|
||||
"b_preds": 85,
|
||||
"a_edges": 109,
|
||||
"b_edges": 116
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,344 @@
|
||||
{
|
||||
"results": [
|
||||
{
|
||||
"name": "Claude: Modern website redesign with portfolio and contact form",
|
||||
"bucket": "high",
|
||||
"tier1_entities": 30,
|
||||
"doc_chars": 25254,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 12000,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": true,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": true,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "personal",
|
||||
"one_sentence_summary": "Discussion on redesigning a personal website for an artist and designer focusing on portfolio and contact form"
|
||||
},
|
||||
"metadata_elapsed_s": 128.0,
|
||||
"custom_extraction_instructions": "This is a personal document in prose format. Summary: Discussion on redesigning a personal website for an artist and designer focusing on portfolio and contact form This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 24.9,
|
||||
"submit_result": {
|
||||
"ok": true
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Claude: Preparing for dinner with Jim Agutter",
|
||||
"bucket": "high",
|
||||
"tier1_entities": 29,
|
||||
"doc_chars": 10053,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 10053,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": true,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "Preparation for a dinner meeting with Jim Agutter to discuss his background, roles, research focus, projects, awards, and strategies for discussing contribution to the broader mission and interdisciplinary partnerships at his institution."
|
||||
},
|
||||
"metadata_elapsed_s": 96.7,
|
||||
"custom_extraction_instructions": "This is a educational document in prose format. Summary: Preparation for a dinner meeting with Jim Agutter to discuss his background, roles, research focus, projects, awards, and strategies for discussing contribution to the broader mission and interdisciplinary partnerships at his institution. This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 37.0,
|
||||
"submit_result": {
|
||||
"ok": true
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Claude: SUNY school closure risk and New Paltz",
|
||||
"bucket": "high",
|
||||
"tier1_entities": 28,
|
||||
"doc_chars": 25937,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 12000,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": true,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": true
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": true,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": true,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "Discussion about the risk of closure for a specific program at SUNY New Paltz and the potential impact on the program director's career"
|
||||
},
|
||||
"metadata_elapsed_s": 119.0,
|
||||
"custom_extraction_instructions": "This is a educational document in prose format. Summary: Discussion about the risk of closure for a specific program at SUNY New Paltz and the potential impact on the program director's career This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 27.4,
|
||||
"submit_result": {
|
||||
"ok": true
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Sanity CMS overview",
|
||||
"bucket": "mid",
|
||||
"tier1_entities": 10,
|
||||
"doc_chars": 14183,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 12000,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "Discussion about Sanity CMS, its functionality, and how it is used with Next.js for content management."
|
||||
},
|
||||
"metadata_elapsed_s": 146.0,
|
||||
"custom_extraction_instructions": "This is a educational document in prose format. Summary: Discussion about Sanity CMS, its functionality, and how it is used with Next.js for content management. This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 20.8,
|
||||
"submit_result": {
|
||||
"ok": true
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Refactor Balance Logic.",
|
||||
"bucket": "mid",
|
||||
"tier1_entities": 9,
|
||||
"doc_chars": 145479,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 12000,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": false,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": true
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": false,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": true,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "technical",
|
||||
"one_sentence_summary": "The document discusses a MedianMaintainingHeap class in Python for maintaining a running median in a stream of data using two heaps, a max heap and a min heap."
|
||||
},
|
||||
"metadata_elapsed_s": 143.1,
|
||||
"custom_extraction_instructions": "This is a technical document in prose format. Summary: The document discusses a MedianMaintainingHeap class in Python for maintaining a running median in a stream of data using two heaps, a max heap and a min heap. This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 18.1,
|
||||
"submit_result": {
|
||||
"ok": true
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Claude: Realtor requirements for home buying in New York",
|
||||
"bucket": "mid",
|
||||
"tier1_entities": 9,
|
||||
"doc_chars": 15729,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 12000,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": true
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": false,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": true,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "Document explains the process of buying a home in New York without a realtor, outlining steps such as financing, scheduling showings, conducting comparative market analysis, making an offer, hiring a real estate attorney, getting a home inspection, finalizing mortgage, and closing."
|
||||
},
|
||||
"metadata_elapsed_s": 118.9,
|
||||
"custom_extraction_instructions": "This is a educational document in prose format. Summary: Document explains the process of buying a home in New York without a realtor, outlining steps such as financing, scheduling showings, conducting comparative market analysis, making an offer, hiring a real estate attorney, getting a home inspection, finalizing mortgage, and closing. This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 18.6,
|
||||
"submit_result": {
|
||||
"ok": true
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Python __name__ Main Explanation",
|
||||
"bucket": "low",
|
||||
"tier1_entities": 3,
|
||||
"doc_chars": 4328,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 4328,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": true,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": true,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": false,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "Discussion on Python script execution, dot product calculation, and usage of enumerate function in Python"
|
||||
},
|
||||
"metadata_elapsed_s": 53.8,
|
||||
"custom_extraction_instructions": "This is a educational document in prose format. Summary: Discussion on Python script execution, dot product calculation, and usage of enumerate function in Python This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 7.9,
|
||||
"submit_result": {
|
||||
"ok": true
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Push changes to repo",
|
||||
"bucket": "low",
|
||||
"tier1_entities": 3,
|
||||
"doc_chars": 1323,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 1323,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": false,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": true,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": false,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "A conversation on how to push changes made in a local development directory to a Git repository"
|
||||
},
|
||||
"metadata_elapsed_s": 24.5,
|
||||
"custom_extraction_instructions": "This is a educational document in prose format. Summary: A conversation on how to push changes made in a local development directory to a Git repository This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 7.2,
|
||||
"submit_result": {
|
||||
"ok": true
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Wearable Marquees uw4.pptx",
|
||||
"bucket": "document",
|
||||
"tier1_entities": 13,
|
||||
"doc_chars": 11408,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 11408,
|
||||
"primary_format": "code",
|
||||
"structural_signals": {
|
||||
"has_headings": false,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": true,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": false,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "This document is a code for a wearable marquee with the text 'Wearable Marquees UW \u2013 Stout September 2025' displayed using an Adafruit DotStar Matrix."
|
||||
},
|
||||
"metadata_elapsed_s": 152.3,
|
||||
"custom_extraction_instructions": "This is a educational document in code format. Summary: This document is a code for a wearable marquee with the text 'Wearable Marquees UW \u2013 Stout September 2025' displayed using an Adafruit DotStar Matrix. This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 20.0,
|
||||
"submit_result": {
|
||||
"ok": true
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Nic Oconnor Field Work F2023 Syllabus.docx",
|
||||
"bucket": "document",
|
||||
"tier1_entities": 13,
|
||||
"doc_chars": 17142,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 12000,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": true
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": true,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": true,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "This document details an independent study focusing on a direct work experience at Dogwood Entertainment for DDF 795, a 3D modeling course."
|
||||
},
|
||||
"metadata_elapsed_s": 118.4,
|
||||
"custom_extraction_instructions": "This is a educational document in prose format. Summary: This document details an independent study focusing on a direct work experience at Dogwood Entertainment for DDF 795, a 3D modeling course. This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 26.3,
|
||||
"submit_result": {
|
||||
"ok": true
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,354 @@
|
||||
{
|
||||
"results": [
|
||||
{
|
||||
"name": "Claude: Modern website redesign with portfolio and contact form",
|
||||
"bucket": "high",
|
||||
"tier1_entities": 30,
|
||||
"doc_chars": 25254,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 12000,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": true
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": true,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "personal",
|
||||
"one_sentence_summary": "Discussion on redesigning a personal website for a designer, focusing on portfolio and contact form"
|
||||
},
|
||||
"metadata_elapsed_s": 128.8,
|
||||
"source_description": "This is a personal document in prose format. Summary: Discussion on redesigning a personal website for a designer, focusing on portfolio and contact form This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 34.2,
|
||||
"submit_result": {
|
||||
"ok": true,
|
||||
"count": 1
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Claude: Preparing for dinner with Jim Agutter",
|
||||
"bucket": "high",
|
||||
"tier1_entities": 29,
|
||||
"doc_chars": 10053,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 10053,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": true
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": true,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "A conversation between the interviewee and Claude about preparing for a dinner with Jim Agutter, discussing his background, current roles, research focus, recent projects, awards, and strategies for their conversation"
|
||||
},
|
||||
"metadata_elapsed_s": 95.3,
|
||||
"source_description": "This is a educational document in prose format. Summary: A conversation between the interviewee and Claude about preparing for a dinner with Jim Agutter, discussing his background, current roles, research focus, recent projects, awards, and strategies for their conversation This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 48.1,
|
||||
"submit_result": {
|
||||
"ok": true,
|
||||
"count": 1
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Claude: SUNY school closure risk and New Paltz",
|
||||
"bucket": "high",
|
||||
"tier1_entities": 28,
|
||||
"doc_chars": 25937,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 12000,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": true
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": true,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": true,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "Discussion about the risk of program closure at SUNY New Paltz and the pros and cons of staying or leaving the institution"
|
||||
},
|
||||
"metadata_elapsed_s": 118.1,
|
||||
"source_description": "This is a educational document in prose format. Summary: Discussion about the risk of program closure at SUNY New Paltz and the pros and cons of staying or leaving the institution This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 44.4,
|
||||
"submit_result": {
|
||||
"ok": true,
|
||||
"count": 1
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Sanity CMS overview",
|
||||
"bucket": "mid",
|
||||
"tier1_entities": 10,
|
||||
"doc_chars": 14183,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 12000,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "This document provides an overview of Sanity CMS, its functionality and features."
|
||||
},
|
||||
"metadata_elapsed_s": 143.4,
|
||||
"source_description": "This is a educational document in prose format. Summary: This document provides an overview of Sanity CMS, its functionality and features. This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 31.3,
|
||||
"submit_result": {
|
||||
"ok": true,
|
||||
"count": 1
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Refactor Balance Logic.",
|
||||
"bucket": "mid",
|
||||
"tier1_entities": 9,
|
||||
"doc_chars": 145479,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 12000,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": false,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": true,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "technical",
|
||||
"one_sentence_summary": "A document discussing a MedianMaintainingHeap class for maintaining the median of a stream of data using two heaps"
|
||||
},
|
||||
"metadata_elapsed_s": 141.6,
|
||||
"source_description": "This is a technical document in prose format. Summary: A document discussing a MedianMaintainingHeap class for maintaining the median of a stream of data using two heaps This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 22.8,
|
||||
"submit_result": {
|
||||
"ok": true,
|
||||
"count": 1
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Claude: Realtor requirements for home buying in New York",
|
||||
"bucket": "mid",
|
||||
"tier1_entities": 9,
|
||||
"doc_chars": 15729,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 12000,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": true
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": true,
|
||||
"has_institutional_language": false,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": true,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "A conversation discussing the process of buying a home in New York without a realtor and the role a realtor plays"
|
||||
},
|
||||
"metadata_elapsed_s": 115.4,
|
||||
"source_description": "This is a educational document in prose format. Summary: A conversation discussing the process of buying a home in New York without a realtor and the role a realtor plays This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 15.1,
|
||||
"submit_result": {
|
||||
"ok": true,
|
||||
"count": 1
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Python __name__ Main Explanation",
|
||||
"bucket": "low",
|
||||
"tier1_entities": 3,
|
||||
"doc_chars": 4328,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 4328,
|
||||
"primary_format": "mixed",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": true,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": true,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": false,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "The document provides explanations about various topics in Python programming, including the __name__ check, vector dot product, and enumerate function."
|
||||
},
|
||||
"metadata_elapsed_s": 54.7,
|
||||
"source_description": "This is a educational document in mixed format. Summary: The document provides explanations about various topics in Python programming, including the __name__ check, vector dot product, and enumerate function. This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 16.8,
|
||||
"submit_result": {
|
||||
"ok": true,
|
||||
"count": 1
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Push changes to repo",
|
||||
"bucket": "low",
|
||||
"tier1_entities": 3,
|
||||
"doc_chars": 1323,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 1323,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": false,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": true,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": false,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "Instructions on how to push changes made locally in a Git repository."
|
||||
},
|
||||
"metadata_elapsed_s": 23.7,
|
||||
"source_description": "This is a educational document in prose format. Summary: Instructions on how to push changes made locally in a Git repository. This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 10.8,
|
||||
"submit_result": {
|
||||
"ok": true,
|
||||
"count": 1
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Wearable Marquees uw4.pptx",
|
||||
"bucket": "document",
|
||||
"tier1_entities": 13,
|
||||
"doc_chars": 11408,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 11408,
|
||||
"primary_format": "code",
|
||||
"structural_signals": {
|
||||
"has_headings": false,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": true,
|
||||
"has_dates": false
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": false,
|
||||
"has_institutional_language": false,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": false,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "technical",
|
||||
"one_sentence_summary": "This document contains code for a wearable marquee display using Adafruit DotStar Matrix."
|
||||
},
|
||||
"metadata_elapsed_s": 150.0,
|
||||
"source_description": "This is a technical document in code format. Summary: This document contains code for a wearable marquee display using Adafruit DotStar Matrix. This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 31.9,
|
||||
"submit_result": {
|
||||
"ok": true,
|
||||
"count": 1
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Nic Oconnor Field Work F2023 Syllabus.docx",
|
||||
"bucket": "document",
|
||||
"tier1_entities": 13,
|
||||
"doc_chars": 17142,
|
||||
"metadata": {
|
||||
"language": "en",
|
||||
"char_length": 12000,
|
||||
"primary_format": "prose",
|
||||
"structural_signals": {
|
||||
"has_headings": true,
|
||||
"has_bullet_lists": false,
|
||||
"has_numbered_lists": false,
|
||||
"has_tables": false,
|
||||
"has_code_blocks": false,
|
||||
"has_dates": true
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": true,
|
||||
"has_institutional_language": true,
|
||||
"has_technical_terminology": true,
|
||||
"has_first_person": true,
|
||||
"has_quotations": false
|
||||
},
|
||||
"domain_class": "educational",
|
||||
"one_sentence_summary": "This document is a course syllabus for an independent study in 3D modeling at New Paltz University in Fall 2023"
|
||||
},
|
||||
"metadata_elapsed_s": 118.0,
|
||||
"source_description": "This is a educational document in prose format. Summary: This document is a course syllabus for an independent study in 3D modeling at New Paltz University in Fall 2023 This metadata is provided to orient your extraction, not to constrain it. Extract entities and relationships freely from the document text itself; the metadata is descriptive context, not a checklist.",
|
||||
"submit_elapsed_s": 29.8,
|
||||
"submit_result": {
|
||||
"ok": true,
|
||||
"count": 1
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,61 @@
|
||||
{
|
||||
"metadata": {
|
||||
"purpose": "E1 cascade re-extraction sample (n=10)",
|
||||
"stratification": "density buckets + document subset",
|
||||
"quartile_top": 19,
|
||||
"quartile_bottom": 5,
|
||||
"total_tier1_episodes": 250
|
||||
},
|
||||
"selected": [
|
||||
{
|
||||
"name": "Claude: Modern website redesign with portfolio and contact form",
|
||||
"entities": 30,
|
||||
"bucket": "high"
|
||||
},
|
||||
{
|
||||
"name": "Claude: Preparing for dinner with Jim Agutter",
|
||||
"entities": 29,
|
||||
"bucket": "high"
|
||||
},
|
||||
{
|
||||
"name": "Claude: SUNY school closure risk and New Paltz",
|
||||
"entities": 28,
|
||||
"bucket": "high"
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Sanity CMS overview",
|
||||
"entities": 10,
|
||||
"bucket": "mid"
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Refactor Balance Logic.",
|
||||
"entities": 9,
|
||||
"bucket": "mid"
|
||||
},
|
||||
{
|
||||
"name": "Claude: Realtor requirements for home buying in New York",
|
||||
"entities": 9,
|
||||
"bucket": "mid"
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Python __name__ Main Explanation",
|
||||
"entities": 3,
|
||||
"bucket": "low"
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Push changes to repo",
|
||||
"entities": 3,
|
||||
"bucket": "low"
|
||||
},
|
||||
{
|
||||
"name": "Wearable Marquees uw4.pptx",
|
||||
"entities": 13,
|
||||
"bucket": "document"
|
||||
},
|
||||
{
|
||||
"name": "Nic Oconnor Field Work F2023 Syllabus.docx",
|
||||
"entities": 13,
|
||||
"bucket": "document"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,736 @@
|
||||
{
|
||||
"experiment": "cascade_test",
|
||||
"title": "Nodes-vs-Edges Cascade Experiment",
|
||||
"started_at": "2026-04-28T05:13:36.125237+00:00",
|
||||
"completed_at": "2026-04-28T05:24:53.344615+00:00",
|
||||
"haiku_model": "claude-haiku-4-5-20251001",
|
||||
"haiku_temperature": 0.0,
|
||||
"haiku_max_tokens": 4096,
|
||||
"local_model": "mistral",
|
||||
"max_doc_chars": 8000,
|
||||
"n_documents": 15,
|
||||
"n_valid_pairs": 14,
|
||||
"n_skipped": 1,
|
||||
"total_elapsed_s": 677.2,
|
||||
"totals": {
|
||||
"a_input_tokens": 11015,
|
||||
"a_output_tokens": 18803,
|
||||
"b_input_tokens": 12685,
|
||||
"b_output_tokens": 15809,
|
||||
"a_cost_usd": 0.105,
|
||||
"b_cost_usd": 0.0917,
|
||||
"cost_delta_usd": -0.0133,
|
||||
"cost_delta_pct": -12.66,
|
||||
"note": "API cost only \u2014 local Mistral runtime on VPS not monetized"
|
||||
},
|
||||
"by_size_bucket": {
|
||||
"small": {
|
||||
"n": 5,
|
||||
"a_input_tokens": 1236,
|
||||
"a_output_tokens": 2788,
|
||||
"b_input_tokens": 1599,
|
||||
"b_output_tokens": 2815,
|
||||
"input_delta_pct": 29.37,
|
||||
"output_delta_pct": 0.97,
|
||||
"a_avg_edges": 14,
|
||||
"b_avg_edges": 13.6
|
||||
},
|
||||
"medium": {
|
||||
"n": 6,
|
||||
"a_input_tokens": 3595,
|
||||
"a_output_tokens": 8477,
|
||||
"b_input_tokens": 4295,
|
||||
"b_output_tokens": 6153,
|
||||
"input_delta_pct": 19.47,
|
||||
"output_delta_pct": -27.42,
|
||||
"a_avg_edges": 32.5,
|
||||
"b_avg_edges": 24
|
||||
},
|
||||
"large": {
|
||||
"n": 3,
|
||||
"a_input_tokens": 6184,
|
||||
"a_output_tokens": 7538,
|
||||
"b_input_tokens": 6791,
|
||||
"b_output_tokens": 6841,
|
||||
"input_delta_pct": 9.82,
|
||||
"output_delta_pct": -9.25,
|
||||
"a_avg_edges": 40,
|
||||
"b_avg_edges": 32
|
||||
}
|
||||
},
|
||||
"results": [
|
||||
{
|
||||
"source": "02_2D Geometry.docx",
|
||||
"size_bucket": "small",
|
||||
"doc_chars_original": 188,
|
||||
"doc_chars_sent": 188,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 207,
|
||||
"output_tokens": 718,
|
||||
"latency_s": 3.21,
|
||||
"entity_count": 10,
|
||||
"edge_count": 21,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"2D geometry\", \"type\": \"geometric concept\"},\n {\"name\": \"Curves\", \"type\": \"geometric element\"},\n {\"name\": \"lines\", \"type\": \"geometric element\"},\n {\"name\": \"Poly lines\", \"type\": \"geometric element\"},\n {\"name\": \"planar\", \"type\": \"geometric property\"},\n {\"name\": \"non-planar\", \"type\": \"geometric property\"},\n {\"name\": \"open\", \"type\": \"geometric property\"},\n {\"name\": \"closed\", \"type\": \"geometric property\"},\n {\"name\": \"complex forms\", \"type\": \"geometric concept\"},\n {\"name\": \"geometries\", \"type\": \"geometric concept\"}\n ],\n \"edges\": [\n {\"subject\": \"2D geometry\", \"predicate\": \"is comprised of\", \"object\": \"Curves\"},\n {\"subject\": \"2D geometry\", \"predicate\": \"is comprised of\", \"object\": \"lines\"},\n {\"subject\": \"2D geometry\", \"predicate\": \"is comprised of\", \"object\": \"Poly lines\"},\n {\"subject\": \"Curves\", \"predicate\": \"can be\", \"object\": \"planar\"},\n {\"subject\": \"Curves\", \"predicate\": \"can be\", \"object\": \"non-planar\"},\n {\"subject\": \"Curves\", \"predicate\": \"can be\", \"object\": \"open\"},\n {\"subject\": \"Curves\", \"predicate\": \"can be\", \"object\": \"closed\"},\n {\"subject\": \"lines\", \"predicate\": \"can be\", \"object\": \"planar\"},\n {\"subject\": \"lines\", \"predicate\": \"can be\", \"object\": \"non-planar\"},\n {\"subject\": \"lines\", \"predicate\": \"can be\", \"object\": \"open\"},\n {\"subject\": \"lines\", \"predicate\": \"can be\", \"object\": \"closed\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"can be\", \"object\": \"planar\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"can be\", \"object\": \"non-planar\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"can be\", \"object\": \"open\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"can be\", \"object\": \"closed\"},\n {\"subject\": \"Curves\", \"predicate\": \"are building blocks to produce\", \"object\": \"complex forms\"},\n {\"subject\": \"lines\", \"predicate\": \"are building blocks to produce\", \"object\": \"complex forms\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"are building blocks to produce\", \"object\": \"complex forms\"},\n {\"subject\": \"Curves\", \"predicate\": \"are building blocks to produce\", \"object\": \"geometries\"},\n {\"subject\": \"lines\", \"predicate\": \"are building blocks to produce\", \"object\": \"geometries\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"are building blocks to produce\", \"object\": \"geometries\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 2.73,
|
||||
"local_candidates": [
|
||||
"BASICS",
|
||||
"Curves",
|
||||
"lines",
|
||||
"Poly lines",
|
||||
"Planar",
|
||||
"non-planar",
|
||||
"open",
|
||||
"closed"
|
||||
],
|
||||
"local_raw": "{\n \"candidates\": [\"BASICS\", \"Curves\", \"lines\", \"Poly lines\", \"Planar\", \"non-planar\", \"open\", \"closed\"]\n}",
|
||||
"api_input_tokens": 278,
|
||||
"api_output_tokens": 735,
|
||||
"api_latency_s": 4.31,
|
||||
"entity_count": 11,
|
||||
"edge_count": 21,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"BASICS\", \"type\": \"concept\"},\n {\"name\": \"2D geometry\", \"type\": \"geometric_domain\"},\n {\"name\": \"Curves\", \"type\": \"geometric_element\"},\n {\"name\": \"lines\", \"type\": \"geometric_element\"},\n {\"name\": \"Poly lines\", \"type\": \"geometric_element\"},\n {\"name\": \"planar\", \"type\": \"property\"},\n {\"name\": \"non-planar\", \"type\": \"property\"},\n {\"name\": \"open\", \"type\": \"property\"},\n {\"name\": \"closed\", \"type\": \"property\"},\n {\"name\": \"complex forms\", \"type\": \"geometric_construct\"},\n {\"name\": \"geometries\", \"type\": \"geometric_construct\"}\n ],\n \"edges\": [\n {\"subject\": \"2D geometry\", \"predicate\": \"is comprised of\", \"object\": \"Curves\"},\n {\"subject\": \"2D geometry\", \"predicate\": \"is comprised of\", \"object\": \"lines\"},\n {\"subject\": \"2D geometry\", \"predicate\": \"is comprised of\", \"object\": \"Poly lines\"},\n {\"subject\": \"Curves\", \"predicate\": \"can be\", \"object\": \"planar\"},\n {\"subject\": \"Curves\", \"predicate\": \"can be\", \"object\": \"non-planar\"},\n {\"subject\": \"Curves\", \"predicate\": \"can be\", \"object\": \"open\"},\n {\"subject\": \"Curves\", \"predicate\": \"can be\", \"object\": \"closed\"},\n {\"subject\": \"lines\", \"predicate\": \"can be\", \"object\": \"planar\"},\n {\"subject\": \"lines\", \"predicate\": \"can be\", \"object\": \"non-planar\"},\n {\"subject\": \"lines\", \"predicate\": \"can be\", \"object\": \"open\"},\n {\"subject\": \"lines\", \"predicate\": \"can be\", \"object\": \"closed\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"can be\", \"object\": \"planar\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"can be\", \"object\": \"non-planar\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"can be\", \"object\": \"open\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"can be\", \"object\": \"closed\"},\n {\"subject\": \"Curves\", \"predicate\": \"are building blocks to produce\", \"object\": \"complex forms\"},\n {\"subject\": \"lines\", \"predicate\": \"are building blocks to produce\", \"object\": \"complex forms\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"are building blocks to produce\", \"object\": \"complex forms\"},\n {\"subject\": \"Curves\", \"predicate\": \"are building blocks to produce\", \"object\": \"geometries\"},\n {\"subject\": \"lines\", \"predicate\": \"are building blocks to produce\", \"object\": \"geometries\"},\n {\"subject\": \"Poly lines\", \"predicate\": \"are building blocks to produce\", \"object\": \"geometries\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "03_2D Transformation and Deformation.docx",
|
||||
"size_bucket": "small",
|
||||
"doc_chars_original": 418,
|
||||
"doc_chars_sent": 418,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 245,
|
||||
"output_tokens": 471,
|
||||
"latency_s": 2.41,
|
||||
"entity_count": 10,
|
||||
"edge_count": 11,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Manipulation of geometry\", \"type\": \"Concept\"},\n {\"name\": \"Transformations\", \"type\": \"Category\"},\n {\"name\": \"Deformations\", \"type\": \"Category\"},\n {\"name\": \"proportions\", \"type\": \"Property\"},\n {\"name\": \"relative geometries\", \"type\": \"Property\"},\n {\"name\": \"curves\", \"type\": \"Object\"},\n {\"name\": \"objects\", \"type\": \"Object\"},\n {\"name\": \"space\", \"type\": \"Context\"},\n {\"name\": \"angles\", \"type\": \"Property\"},\n {\"name\": \"curvature\", \"type\": \"Property\"}\n ],\n \"edges\": [\n {\"subject\": \"Manipulation of geometry\", \"predicate\": \"can be broken down into\", \"object\": \"Transformations\"},\n {\"subject\": \"Manipulation of geometry\", \"predicate\": \"can be broken down into\", \"object\": \"Deformations\"},\n {\"subject\": \"Transformations\", \"predicate\": \"do not change\", \"object\": \"proportions\"},\n {\"subject\": \"Transformations\", \"predicate\": \"do not change\", \"object\": \"relative geometries\"},\n {\"subject\": \"Transformations\", \"predicate\": \"change\", \"object\": \"relationship to space\"},\n {\"subject\": \"Deformations\", \"predicate\": \"are changes made to\", \"object\": \"existing geometry\"},\n {\"subject\": \"Deformations\", \"predicate\": \"affect\", \"object\": \"proportions\"},\n {\"subject\": \"Deformations\", \"predicate\": \"affect\", \"object\": \"angles\"},\n {\"subject\": \"Deformations\", \"predicate\": \"affect\", \"object\": \"curvature\"},\n {\"subject\": \"proportions\", \"predicate\": \"belong to\", \"object\": \"curves\"},\n {\"subject\": \"proportions\", \"predicate\": \"belong to\", \"object\": \"objects\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 4.36,
|
||||
"local_candidates": [
|
||||
"Manipulation",
|
||||
"Geometry",
|
||||
"Transformations",
|
||||
"Deformations",
|
||||
"curves",
|
||||
"objects",
|
||||
"space"
|
||||
],
|
||||
"local_raw": "{\n \"candidates\": [\"Manipulation\", \"Geometry\", \"Transformations\", \"Deformations\", \"curves\", \"objects\", \"space\"]\n}",
|
||||
"api_input_tokens": 311,
|
||||
"api_output_tokens": 552,
|
||||
"api_latency_s": 2.81,
|
||||
"entity_count": 11,
|
||||
"edge_count": 13,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Manipulation\", \"type\": \"Concept\"},\n {\"name\": \"Geometry\", \"type\": \"Concept\"},\n {\"name\": \"Transformations\", \"type\": \"Category\"},\n {\"name\": \"Deformations\", \"type\": \"Category\"},\n {\"name\": \"Curves\", \"type\": \"Geometric Element\"},\n {\"name\": \"Objects\", \"type\": \"Geometric Element\"},\n {\"name\": \"Space\", \"type\": \"Concept\"},\n {\"name\": \"Proportions\", \"type\": \"Property\"},\n {\"name\": \"Relative geometries\", \"type\": \"Property\"},\n {\"name\": \"Angles\", \"type\": \"Property\"},\n {\"name\": \"Curvature\", \"type\": \"Property\"}\n ],\n \"edges\": [\n {\"subject\": \"Manipulation\", \"predicate\": \"can be broken down into\", \"object\": \"Transformations\"},\n {\"subject\": \"Manipulation\", \"predicate\": \"can be broken down into\", \"object\": \"Deformations\"},\n {\"subject\": \"Transformations\", \"predicate\": \"is a category of\", \"object\": \"Geometry\"},\n {\"subject\": \"Deformations\", \"predicate\": \"is a category of\", \"object\": \"Geometry\"},\n {\"subject\": \"Transformations\", \"predicate\": \"do not change\", \"object\": \"Proportions\"},\n {\"subject\": \"Transformations\", \"predicate\": \"do not change\", \"object\": \"Relative geometries\"},\n {\"subject\": \"Transformations\", \"predicate\": \"apply to\", \"object\": \"Curves\"},\n {\"subject\": \"Transformations\", \"predicate\": \"apply to\", \"object\": \"Objects\"},\n {\"subject\": \"Transformations\", \"predicate\": \"change relationship to\", \"object\": \"Space\"},\n {\"subject\": \"Deformations\", \"predicate\": \"change\", \"object\": \"Proportions\"},\n {\"subject\": \"Deformations\", \"predicate\": \"change\", \"object\": \"Angles\"},\n {\"subject\": \"Deformations\", \"predicate\": \"change\", \"object\": \"Curvature\"},\n {\"subject\": \"Deformations\", \"predicate\": \"apply to\", \"object\": \"Geometry\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "03_2D Transformation, Deformation, and Editing-AARON.docx",
|
||||
"size_bucket": "small",
|
||||
"doc_chars_original": 541,
|
||||
"doc_chars_sent": 541,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 271,
|
||||
"output_tokens": 587,
|
||||
"latency_s": 2.7,
|
||||
"entity_count": 14,
|
||||
"edge_count": 13,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Manipulation of geometry\", \"type\": \"Concept\"},\n {\"name\": \"Transformations\", \"type\": \"Category\"},\n {\"name\": \"Deformations\", \"type\": \"Category\"},\n {\"name\": \"proportions\", \"type\": \"Property\"},\n {\"name\": \"relative geometries\", \"type\": \"Property\"},\n {\"name\": \"curves\", \"type\": \"Geometric Element\"},\n {\"name\": \"objects\", \"type\": \"Geometric Element\"},\n {\"name\": \"space\", \"type\": \"Concept\"},\n {\"name\": \"angles\", \"type\": \"Property\"},\n {\"name\": \"curvature\", \"type\": \"Property\"},\n {\"name\": \"Editing Commands\", \"type\": \"Concept\"},\n {\"name\": \"geometry\", \"type\": \"Concept\"},\n {\"name\": \"sections\", \"type\": \"Geometric Element\"},\n {\"name\": \"form\", \"type\": \"Geometric Element\"}\n ],\n \"edges\": [\n {\"subject\": \"Manipulation of geometry\", \"predicate\": \"can be broken down into\", \"object\": \"Transformations\"},\n {\"subject\": \"Manipulation of geometry\", \"predicate\": \"can be broken down into\", \"object\": \"Deformations\"},\n {\"subject\": \"Transformations\", \"predicate\": \"do not change\", \"object\": \"proportions\"},\n {\"subject\": \"Transformations\", \"predicate\": \"do not change\", \"object\": \"relative geometries\"},\n {\"subject\": \"Transformations\", \"predicate\": \"change\", \"object\": \"space\"},\n {\"subject\": \"Deformations\", \"predicate\": \"are changes made to\", \"object\": \"geometry\"},\n {\"subject\": \"Deformations\", \"predicate\": \"affect\", \"object\": \"proportions\"},\n {\"subject\": \"Deformations\", \"predicate\": \"affect\", \"object\": \"angles\"},\n {\"subject\": \"Deformations\", \"predicate\": \"affect\", \"object\": \"curvature\"},\n {\"subject\": \"Editing Commands\", \"predicate\": \"can edit\", \"object\": \"geometry\"},\n {\"subject\": \"Editing Commands\", \"predicate\": \"involves adding\", \"object\": \"sections\"},\n {\"subject\": \"Editing Commands\", \"predicate\": \"involves combining\", \"object\": \"sections\"},\n {\"subject\": \"Editing Commands\", \"predicate\": \"involves removing\", \"object\": \"sections\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 6.06,
|
||||
"local_candidates": [
|
||||
"Manipulation",
|
||||
"Geometry",
|
||||
"Transformations",
|
||||
"Deformations",
|
||||
"Curves",
|
||||
"Objects",
|
||||
"Space",
|
||||
"Editing Commands",
|
||||
"Adding",
|
||||
"Combining",
|
||||
"Removing"
|
||||
],
|
||||
"local_raw": "{\n \"candidates\": [\"Manipulation\", \"Geometry\", \"Transformations\", \"Deformations\", \"Curves\", \"Objects\", \"Space\", \"Editing Commands\", \"Adding\", \"Combining\", \"Removing\"]\n}",
|
||||
"api_input_tokens": 354,
|
||||
"api_output_tokens": 620,
|
||||
"api_latency_s": 3.03,
|
||||
"entity_count": 15,
|
||||
"edge_count": 13,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Manipulation\", \"type\": \"Concept\"},\n {\"name\": \"Geometry\", \"type\": \"Concept\"},\n {\"name\": \"Transformations\", \"type\": \"Category\"},\n {\"name\": \"Deformations\", \"type\": \"Category\"},\n {\"name\": \"Curves\", \"type\": \"Geometric Element\"},\n {\"name\": \"Objects\", \"type\": \"Geometric Element\"},\n {\"name\": \"Space\", \"type\": \"Concept\"},\n {\"name\": \"Proportions\", \"type\": \"Property\"},\n {\"name\": \"Relative Geometries\", \"type\": \"Property\"},\n {\"name\": \"Angles\", \"type\": \"Property\"},\n {\"name\": \"Curvature\", \"type\": \"Property\"},\n {\"name\": \"Editing Commands\", \"type\": \"Operation\"},\n {\"name\": \"Adding\", \"type\": \"Editing Operation\"},\n {\"name\": \"Combining\", \"type\": \"Editing Operation\"},\n {\"name\": \"Removing\", \"type\": \"Editing Operation\"}\n ],\n \"edges\": [\n {\"subject\": \"Manipulation\", \"predicate\": \"can be broken down into\", \"object\": \"Transformations\"},\n {\"subject\": \"Manipulation\", \"predicate\": \"can be broken down into\", \"object\": \"Deformations\"},\n {\"subject\": \"Transformations\", \"predicate\": \"do not change\", \"object\": \"Proportions\"},\n {\"subject\": \"Transformations\", \"predicate\": \"do not change\", \"object\": \"Relative Geometries\"},\n {\"subject\": \"Transformations\", \"predicate\": \"change\", \"object\": \"Space\"},\n {\"subject\": \"Deformations\", \"predicate\": \"are changes to\", \"object\": \"Geometry\"},\n {\"subject\": \"Deformations\", \"predicate\": \"affect\", \"object\": \"Proportions\"},\n {\"subject\": \"Deformations\", \"predicate\": \"affect\", \"object\": \"Angles\"},\n {\"subject\": \"Deformations\", \"predicate\": \"affect\", \"object\": \"Curvature\"},\n {\"subject\": \"Editing Commands\", \"predicate\": \"can edit\", \"object\": \"Geometry\"},\n {\"subject\": \"Adding\", \"predicate\": \"is a type of\", \"object\": \"Editing Commands\"},\n {\"subject\": \"Combining\", \"predicate\": \"is a type of\", \"object\": \"Editing Commands\"},\n {\"subject\": \"Removing\", \"predicate\": \"is a type of\", \"object\": \"Editing Commands\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "03_Editing Geometry.docx",
|
||||
"size_bucket": "small",
|
||||
"doc_chars_original": 171,
|
||||
"doc_chars_sent": 171,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 195,
|
||||
"output_tokens": 320,
|
||||
"latency_s": 5.21,
|
||||
"entity_count": 7,
|
||||
"edge_count": 8,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Editing Commands\", \"type\": \"Feature\"},\n {\"name\": \"geometry\", \"type\": \"Object\"},\n {\"name\": \"sections\", \"type\": \"Component\"},\n {\"name\": \"form\", \"type\": \"Structure\"},\n {\"name\": \"adding\", \"type\": \"Operation\"},\n {\"name\": \"combining\", \"type\": \"Operation\"},\n {\"name\": \"removing\", \"type\": \"Operation\"}\n ],\n \"edges\": [\n {\"subject\": \"Editing Commands\", \"predicate\": \"can be used to edit\", \"object\": \"geometry\"},\n {\"subject\": \"geometry\", \"predicate\": \"can be further edited by\", \"object\": \"adding\"},\n {\"subject\": \"geometry\", \"predicate\": \"can be further edited by\", \"object\": \"combining\"},\n {\"subject\": \"geometry\", \"predicate\": \"can be further edited by\", \"object\": \"removing\"},\n {\"subject\": \"adding\", \"predicate\": \"modifies\", \"object\": \"sections\"},\n {\"subject\": \"combining\", \"predicate\": \"modifies\", \"object\": \"sections\"},\n {\"subject\": \"removing\", \"predicate\": \"modifies\", \"object\": \"sections\"},\n {\"subject\": \"sections\", \"predicate\": \"are part of\", \"object\": \"form\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 3.16,
|
||||
"local_candidates": [
|
||||
"Editing",
|
||||
"Commands",
|
||||
"Geometry",
|
||||
"Adding",
|
||||
"Combining",
|
||||
"Removing",
|
||||
"Sections"
|
||||
],
|
||||
"local_raw": "{\n \"candidates\": [\"Editing\", \"Commands\", \"Geometry\", \"Adding\", \"Combining\", \"Removing\", \"Sections\"]\n}",
|
||||
"api_input_tokens": 259,
|
||||
"api_output_tokens": 289,
|
||||
"api_latency_s": 1.69,
|
||||
"entity_count": 6,
|
||||
"edge_count": 7,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Editing Commands\", \"type\": \"Topic\"},\n {\"name\": \"Geometry\", \"type\": \"Concept\"},\n {\"name\": \"Adding\", \"type\": \"Operation\"},\n {\"name\": \"Combining\", \"type\": \"Operation\"},\n {\"name\": \"Removing\", \"type\": \"Operation\"},\n {\"name\": \"Sections\", \"type\": \"Component\"}\n ],\n \"edges\": [\n {\"subject\": \"Geometry\", \"predicate\": \"can be edited through\", \"object\": \"Editing Commands\"},\n {\"subject\": \"Editing Commands\", \"predicate\": \"include\", \"object\": \"Adding\"},\n {\"subject\": \"Editing Commands\", \"predicate\": \"include\", \"object\": \"Combining\"},\n {\"subject\": \"Editing Commands\", \"predicate\": \"include\", \"object\": \"Removing\"},\n {\"subject\": \"Adding\", \"predicate\": \"operates on\", \"object\": \"Sections\"},\n {\"subject\": \"Combining\", \"predicate\": \"operates on\", \"object\": \"Sections\"},\n {\"subject\": \"Removing\", \"predicate\": \"operates on\", \"object\": \"Sections\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "04_Annotations.docx",
|
||||
"size_bucket": "small",
|
||||
"doc_chars_original": 737,
|
||||
"doc_chars_sent": 737,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 318,
|
||||
"output_tokens": 692,
|
||||
"latency_s": 3.44,
|
||||
"entity_count": 17,
|
||||
"edge_count": 17,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Annotations\", \"type\": \"Feature\"},\n {\"name\": \"drawing\", \"type\": \"Document\"},\n {\"name\": \"manufacturing\", \"type\": \"Process\"},\n {\"name\": \"3D viewing program\", \"type\": \"Software\"},\n {\"name\": \"Annotation Settings\", \"type\": \"Configuration\"},\n {\"name\": \"scale\", \"type\": \"Property\"},\n {\"name\": \"location\", \"type\": \"Property\"},\n {\"name\": \"ratio\", \"type\": \"Property\"},\n {\"name\": \"dimension\", \"type\": \"Element\"},\n {\"name\": \"object\", \"type\": \"Entity\"},\n {\"name\": \"Options\", \"type\": \"Menu\"},\n {\"name\": \"Document properties\", \"type\": \"Menu\"},\n {\"name\": \"Annotation Styles\", \"type\": \"Menu\"},\n {\"name\": \"icon\", \"type\": \"UI Element\"},\n {\"name\": \"pop up window\", \"type\": \"UI Element\"},\n {\"name\": \"Annotation style\", \"type\": \"Configuration\"},\n {\"name\": \"APPLY\", \"type\": \"Button\"}\n ],\n \"edges\": [\n {\"subject\": \"Annotations\", \"predicate\": \"communicate\", \"object\": \"technical elements\"},\n {\"subject\": \"Annotations\", \"predicate\": \"used for\", \"object\": \"drawing\"},\n {\"subject\": \"Annotations\", \"predicate\": \"help with\", \"object\": \"manufacturing\"},\n {\"subject\": \"Annotations\", \"predicate\": \"alternative to\", \"object\": \"3D viewing program\"},\n {\"subject\": \"Annotation Settings\", \"predicate\": \"requires proper setup of\", \"object\": \"scale\"},\n {\"subject\": \"Annotation Settings\", \"predicate\": \"requires proper setup of\", \"object\": \"location\"},\n {\"subject\": \"Annotation Settings\", \"predicate\": \"requires proper setup of\", \"object\": \"ratio\"},\n {\"subject\": \"scale\", \"predicate\": \"applies to\", \"object\": \"dimension\"},\n {\"subject\": \"location\", \"predicate\": \"applies to\", \"object\": \"dimension\"},\n {\"subject\": \"ratio\", \"predicate\": \"applies to\", \"object\": \"dimension\"},\n {\"subject\": \"dimension\", \"predicate\": \"should be proportional to\", \"object\": \"object\"},\n {\"subject\": \"Options\", \"predicate\": \"contains\", \"object\": \"Document properties\"},\n {\"subject\": \"Document properties\", \"predicate\": \"contains\", \"object\": \"Annotation Styles\"},\n {\"subject\": \"icon\", \"predicate\": \"opens\", \"object\": \"pop up window\"},\n {\"subject\": \"pop up window\", \"predicate\": \"displays\", \"object\": \"Annotation style\"},\n {\"subject\": \"Annotation style\", \"predicate\": \"can be edited\", \"object\": \"Annotation Settings\"},\n {\"subject\": \"APPLY\", \"predicate\": \"confirms\", \"object\": \"changes\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 7.22,
|
||||
"local_candidates": [
|
||||
"Annotations",
|
||||
"drawing",
|
||||
"object",
|
||||
"manufacturing",
|
||||
"3D viewing program",
|
||||
"Options",
|
||||
"Document properties",
|
||||
"Annotation Styles",
|
||||
"icon",
|
||||
"pop up window"
|
||||
],
|
||||
"local_raw": "{\n \"candidates\": [\"Annotations\", \"drawing\", \"object\", \"manufacturing\", \"3D viewing program\", \"Options\", \"Document properties\", \"Annotation Styles\", \"icon\", \"pop up window\"]\n}",
|
||||
"api_input_tokens": 397,
|
||||
"api_output_tokens": 619,
|
||||
"api_latency_s": 3.23,
|
||||
"entity_count": 16,
|
||||
"edge_count": 14,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Annotations\", \"type\": \"Feature\"},\n {\"name\": \"drawing\", \"type\": \"Document\"},\n {\"name\": \"object\", \"type\": \"Item\"},\n {\"name\": \"manufacturing\", \"type\": \"Process\"},\n {\"name\": \"3D viewing program\", \"type\": \"Software\"},\n {\"name\": \"Annotation Settings\", \"type\": \"Configuration\"},\n {\"name\": \"scale\", \"type\": \"Property\"},\n {\"name\": \"location\", \"type\": \"Property\"},\n {\"name\": \"ratio\", \"type\": \"Property\"},\n {\"name\": \"dimension\", \"type\": \"Property\"},\n {\"name\": \"Options\", \"type\": \"Menu\"},\n {\"name\": \"Document properties\", \"type\": \"Menu\"},\n {\"name\": \"Annotation Styles\", \"type\": \"Menu\"},\n {\"name\": \"icon\", \"type\": \"UI Element\"},\n {\"name\": \"pop up window\", \"type\": \"UI Element\"},\n {\"name\": \"Annotation style\", \"type\": \"Configuration\"}\n ],\n \"edges\": [\n {\"subject\": \"Annotations\", \"predicate\": \"communicate\", \"object\": \"technical elements of a drawing\"},\n {\"subject\": \"Annotations\", \"predicate\": \"help communicate to\", \"object\": \"manufacturing\"},\n {\"subject\": \"Annotations\", \"predicate\": \"help communicate to\", \"object\": \"someone to review\"},\n {\"subject\": \"someone to review\", \"predicate\": \"may lack access to\", \"object\": \"3D viewing program\"},\n {\"subject\": \"Annotation Settings\", \"predicate\": \"requires proper setup of\", \"object\": \"scale\"},\n {\"subject\": \"Annotation Settings\", \"predicate\": \"requires proper setup of\", \"object\": \"location\"},\n {\"subject\": \"Annotation Settings\", \"predicate\": \"requires proper setup of\", \"object\": \"ratio\"},\n {\"subject\": \"Annotation Settings\", \"predicate\": \"requires proper setup of\", \"object\": \"dimension\"},\n {\"subject\": \"Options\", \"predicate\": \"contains\", \"object\": \"Document properties\"},\n {\"subject\": \"Document properties\", \"predicate\": \"contains\", \"object\": \"Annotation Styles\"},\n {\"subject\": \"icon\", \"predicate\": \"provides access to\", \"object\": \"pop up window\"},\n {\"subject\": \"pop up window\", \"predicate\": \"displays\", \"object\": \"Annotation style\"},\n {\"subject\": \"user\", \"predicate\": \"clicks\", \"object\": \"APPLY\"},\n {\"subject\": \"APPLY\", \"predicate\": \"confirms\", \"object\": \"changes\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "00_Syllabus.docx",
|
||||
"size_bucket": "medium",
|
||||
"doc_chars_original": 2273,
|
||||
"doc_chars_sent": 2273,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 607,
|
||||
"output_tokens": 1159,
|
||||
"latency_s": 5.11,
|
||||
"entity_count": 30,
|
||||
"edge_count": 26,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"3D Computer Aided Design drawing\", \"type\": \"subject\"},\n {\"name\": \"rapid manufacturing\", \"type\": \"subject\"},\n {\"name\": \"Rhino3D\", \"type\": \"software\"},\n {\"name\": \"NURBS geometry\", \"type\": \"concept\"},\n {\"name\": \"3D printing\", \"type\": \"technology\"},\n {\"name\": \"CAD software\", \"type\": \"software\"},\n {\"name\": \"digital fabrication software\", \"type\": \"software\"},\n {\"name\": \"digital fabrication hardware\", \"type\": \"hardware\"},\n {\"name\": \"spatial awareness\", \"type\": \"skill\"},\n {\"name\": \"scales\", \"type\": \"concept\"},\n {\"name\": \"measuring devices\", \"type\": \"tool\"},\n {\"name\": \"physical dimensions\", \"type\": \"concept\"},\n {\"name\": \"virtual space\", \"type\": \"concept\"},\n {\"name\": \"three dimensional form\", \"type\": \"concept\"},\n {\"name\": \"digital modeling\", \"type\": \"skill\"},\n {\"name\": \"three dimensional design principals\", \"type\": \"concept\"},\n {\"name\": \"Desktop FDM printing\", \"type\": \"technology\"},\n {\"name\": \"printing process\", \"type\": \"process\"},\n {\"name\": \"tolerances\", \"type\": \"concept\"},\n {\"name\": \"critical thinking\", \"type\": \"skill\"},\n {\"name\": \"problem solving skills\", \"type\": \"skill\"},\n {\"name\": \"research methods\", \"type\": \"skill\"},\n {\"name\": \"Rhino Level 1 Training Guide\", \"type\": \"material\"},\n {\"name\": \"assignments\", \"type\": \"assessment\"},\n {\"name\": \"grade\", \"type\": \"assessment\"},\n {\"name\": \"A grade\", \"type\": \"grade\"},\n {\"name\": \"B grade\", \"type\": \"grade\"},\n {\"name\": \"C grade\", \"type\": \"grade\"},\n {\"name\": \"D grade\", \"type\": \"grade\"},\n {\"name\": \"F grade\", \"type\": \"grade\"}\n ],\n \"edges\": [\n {\"subject\": \"course\", \"predicate\": \"introduces\", \"object\": \"3D Computer Aided Design drawing\"},\n {\"subject\": \"course\", \"predicate\": \"introduces\", \"object\": \"rapid manufacturing\"},\n {\"subject\": \"course\", \"predicate\": \"uses\", \"object\": \"Rhino3D\"},\n {\"subject\": \"students\", \"predicate\": \"become acquainted with\", \"object\": \"virtual space\"},\n {\"subject\": \"students\", \"predicate\": \"become acquainted with\", \"object\": \"NURBS geometry\"},\n {\"subject\": \"students\", \"predicate\": \"gain hands on technical skills with\", \"object\": \"digital fabrication software\"},\n {\"subject\": \"students\", \"predicate\": \"gain hands on technical skills with\", \"object\": \"digital fabrication hardware\"},\n {\"subject\": \"students\", \"predicate\": \"demonstrate\", \"object\": \"spatial awareness\"},\n {\"subject\": \"students\", \"predicate\": \"use\", \"object\": \"scales\"},\n {\"subject\": \"students\", \"predicate\": \"use\", \"object\": \"measuring devices\"},\n {\"subject\": \"students\", \"predicate\": \"translate\", \"object\": \"physical dimensions into virtual space\"},\n {\"subject\": \"students\", \"predicate\": \"gain working knowledge of\", \"object\": \"three dimensional form\"},\n {\"subject\": \"students\", \"predicate\": \"apply\", \"object\": \"three dimensional design principals\"},\n {\"subject\": \"students\", \"predicate\": \"gain understanding of designing for\", \"object\": \"Desktop FDM printing\"},\n {\"subject\": \"students\", \"predicate\": \"understand\", \"object\": \"printing process\"},\n {\"subject\": \"students\", \"predicate\": \"apply\", \"object\": \"tolerances\"},\n {\"subject\": \"students\", \"predicate\": \"use\", \"object\": \"critical thinking\"},\n {\"subject\": \"students\", \"predicate\": \"use\", \"object\": \"problem solving skills\"},\n {\"subject\": \"students\", \"predicate\": \"use\", \"object\": \"research methods\"},\n {\"subject\": \"course\", \"predicate\": \"requires\", \"object\": \"Rhino Level 1 Training Guide\"},\n {\"subject\": \"late assignments\", \"predicate\": \"result in grade reduction of\", \"object\": \"10% per day\"},\n {\"subject\": \"A grade\", \"predicate\": \"represents\", \"object\": \"Outstanding, excellent work\"},\n {\"subject\": \"B grade\", \"predicate\": \"represents\", \"object\": \"Good, competent work\"},\n {\"subject\": \"C grade\", \"predicate\": \"represents\", \"object\": \"Adequate, reasonably satisfactory work\"},\n {\"subject\": \"D grade\", \"predicate\": \"represents\", \"object\": \"Minimally acceptable work\"},\n {\"subject\": \"F grade\",",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 16.34,
|
||||
"local_candidates": [
|
||||
"Rhino3D",
|
||||
"CAD software",
|
||||
"NURBS geometry",
|
||||
"3D printing",
|
||||
"Digital fabrication software",
|
||||
"Digital fabrication hardware",
|
||||
"Desktop FDM printing",
|
||||
"Rhino Level 1 Training Guide"
|
||||
],
|
||||
"local_raw": "{\n \"candidates\": [\"Rhino3D\", \"CAD software\", \"NURBS geometry\", \"3D printing\", \"Digital fabrication software\", \"Digital fabrication hardware\", \"Desktop FDM printing\", \"Rhino Level 1 Training Guide\"]\n}",
|
||||
"api_input_tokens": 699,
|
||||
"api_output_tokens": 858,
|
||||
"api_latency_s": 4.39,
|
||||
"entity_count": 18,
|
||||
"edge_count": 19,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Rhino3D\", \"type\": \"CAD software\"},\n {\"name\": \"CAD software\", \"type\": \"software category\"},\n {\"name\": \"NURBS geometry\", \"type\": \"geometric modeling technique\"},\n {\"name\": \"3D printing\", \"type\": \"manufacturing technology\"},\n {\"name\": \"digital fabrication software\", \"type\": \"software category\"},\n {\"name\": \"digital fabrication hardware\", \"type\": \"hardware category\"},\n {\"name\": \"Desktop FDM printing\", \"type\": \"3D printing technology\"},\n {\"name\": \"Rhino Level 1 Training Guide\", \"type\": \"educational resource\"},\n {\"name\": \"3D Computer Aided Design\", \"type\": \"design discipline\"},\n {\"name\": \"rapid manufacturing\", \"type\": \"manufacturing approach\"},\n {\"name\": \"virtual space\", \"type\": \"digital environment\"},\n {\"name\": \"tangible objects\", \"type\": \"physical output\"},\n {\"name\": \"spatial awareness\", \"type\": \"skill\"},\n {\"name\": \"three dimensional form\", \"type\": \"design concept\"},\n {\"name\": \"digital modeling\", \"type\": \"design process\"},\n {\"name\": \"three dimensional design principals\", \"type\": \"design concept\"},\n {\"name\": \"file preparation\", \"type\": \"technical process\"},\n {\"name\": \"tolerances\", \"type\": \"technical specification\"}\n ],\n \"edges\": [\n {\"subject\": \"Rhino3D\", \"predicate\": \"is a type of\", \"object\": \"CAD software\"},\n {\"subject\": \"Rhino3D\", \"predicate\": \"supports\", \"object\": \"NURBS geometry\"},\n {\"subject\": \"Rhino3D\", \"predicate\": \"enables\", \"object\": \"3D Computer Aided Design\"},\n {\"subject\": \"3D Computer Aided Design\", \"predicate\": \"uses\", \"object\": \"CAD software\"},\n {\"subject\": \"3D Computer Aided Design\", \"predicate\": \"involves\", \"object\": \"NURBS geometry\"},\n {\"subject\": \"3D Computer Aided Design\", \"predicate\": \"enables\", \"object\": \"rapid manufacturing\"},\n {\"subject\": \"rapid manufacturing\", \"predicate\": \"uses\", \"object\": \"3D printing\"},\n {\"subject\": \"3D printing\", \"predicate\": \"produces\", \"object\": \"tangible objects\"},\n {\"subject\": \"digital fabrication software\", \"predicate\": \"includes\", \"object\": \"Rhino3D\"},\n {\"subject\": \"Desktop FDM printing\", \"predicate\": \"is a type of\", \"object\": \"3D printing\"},\n {\"subject\": \"course\", \"predicate\": \"teaches\", \"object\": \"Rhino3D\"},\n {\"subject\": \"course\", \"predicate\": \"teaches\", \"object\": \"digital fabrication software\"},\n {\"subject\": \"course\", \"predicate\": \"teaches\", \"object\": \"digital fabrication hardware\"},\n {\"subject\": \"course\", \"predicate\": \"teaches\", \"object\": \"Desktop FDM printing\"},\n {\"subject\": \"Rhino Level 1 Training Guide\", \"predicate\": \"provides training for\", \"object\": \"Rhino3D\"},\n {\"subject\": \"students\", \"predicate\": \"learn\", \"object\": \"three dimensional design principals\"},\n {\"subject\": \"students\", \"predicate\": \"develop\", \"object\": \"spatial awareness\"},\n {\"subject\": \"file preparation\", \"predicate\": \"requires understanding of\", \"object\": \"tolerances\"},\n {\"subject\": \"Desktop FDM printing\", \"predicate\": \"has\", \"object\": \"tolerances\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "01_ALL_Overview of AM and 3DP_v3.pptx",
|
||||
"size_bucket": "large",
|
||||
"doc_chars_original": 5833,
|
||||
"doc_chars_sent": 5833,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 1579,
|
||||
"output_tokens": 1602,
|
||||
"latency_s": 7.65,
|
||||
"entity_count": 34,
|
||||
"edge_count": 34,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Additive Manufacturing\", \"type\": \"Manufacturing Process\"},\n {\"name\": \"3D Printing\", \"type\": \"Manufacturing Process\"},\n {\"name\": \"Cutting\", \"type\": \"Manufacturing Process\"},\n {\"name\": \"Subtractive Manufacturing\", \"type\": \"Manufacturing Process\"},\n {\"name\": \"Forming\", \"type\": \"Manufacturing Process\"},\n {\"name\": \"ASTM International Committee F42\", \"type\": \"Organization\"},\n {\"name\": \"Wohler's Report 2014\", \"type\": \"Document\"},\n {\"name\": \"Laser Cutting\", \"type\": \"Manufacturing Technique\"},\n {\"name\": \"Carving\", \"type\": \"Manufacturing Technique\"},\n {\"name\": \"Drilling\", \"type\": \"Manufacturing Technique\"},\n {\"name\": \"Milling\", \"type\": \"Manufacturing Technique\"},\n {\"name\": \"Chiseling\", \"type\": \"Manufacturing Technique\"},\n {\"name\": \"Glass Blowing\", \"type\": \"Manufacturing Technique\"},\n {\"name\": \"Vacuum Forming\", \"type\": \"Manufacturing Technique\"},\n {\"name\": \"Hydroforming\", \"type\": \"Manufacturing Technique\"},\n {\"name\": \"PolyJet\", \"type\": \"3D Printing Technology\"},\n {\"name\": \"FDM\", \"type\": \"3D Printing Technology\"},\n {\"name\": \"Neolithic Revolution\", \"type\": \"Historical Period\"},\n {\"name\": \"First Industrial Revolution\", \"type\": \"Historical Period\"},\n {\"name\": \"Second Industrial Revolution\", \"type\": \"Historical Period\"},\n {\"name\": \"Digital Revolution\", \"type\": \"Historical Period\"},\n {\"name\": \"Third Industrial Revolution\", \"type\": \"Historical Period\"},\n {\"name\": \"Decentralized Manufacturing\", \"type\": \"Business Model\"},\n {\"name\": \"Economies of Scale\", \"type\": \"Economic Concept\"},\n {\"name\": \"Economies of Scope\", \"type\": \"Economic Concept\"},\n {\"name\": \"McKinsey & Company\", \"type\": \"Organization\"},\n {\"name\": \"Module 1\", \"type\": \"Educational Module\"},\n {\"name\": \"Freedom\", \"type\": \"Advantage\"},\n {\"name\": \"Closed System\", \"type\": \"Advantage\"},\n {\"name\": \"Quick Production\", \"type\": \"Advantage\"},\n {\"name\": \"Multiple Materials One Print\", \"type\": \"Advantage\"},\n {\"name\": \"Low-cost Manufacturing\", \"type\": \"Advantage\"},\n {\"name\": \"Real Thermoplastics\", \"type\": \"Material\"},\n {\"name\": \"Factory of Tomorrow\", \"type\": \"Concept\"}\n ],\n \"edges\": [\n {\"subject\": \"3D Printing\", \"predicate\": \"is a type of\", \"object\": \"Additive Manufacturing\"},\n {\"subject\": \"Additive Manufacturing\", \"predicate\": \"is defined by\", \"object\": \"ASTM International Committee F42\"},\n {\"subject\": \"Additive Manufacturing\", \"predicate\": \"is documented in\", \"object\": \"Wohler's Report 2014\"},\n {\"subject\": \"Cutting\", \"predicate\": \"is a\", \"object\": \"Manufacturing Process\"},\n {\"subject\": \"Subtractive Manufacturing\", \"predicate\": \"is a\", \"object\": \"Manufacturing Process\"},\n {\"subject\": \"Forming\", \"predicate\": \"is a\", \"object\": \"Manufacturing Process\"},\n {\"subject\": \"Laser Cutting\", \"predicate\": \"is a type of\", \"object\": \"Cutting\"},\n {\"subject\": \"Carving\", \"predicate\": \"is a type of\", \"object\": \"Subtractive Manufacturing\"},\n {\"subject\": \"Drilling\", \"predicate\": \"is a type of\", \"object\": \"Subtractive Manufacturing\"},\n {\"subject\": \"Milling\", \"predicate\": \"is a type of\", \"object\": \"Subtractive Manufacturing\"},\n {\"subject\": \"Chiseling\", \"predicate\": \"is a type of\", \"object\": \"Subtractive Manufacturing\"},\n {\"subject\": \"Glass Blowing\", \"predicate\": \"is a type of\", \"object\": \"Forming\"},\n {\"subject\": \"Vacuum Forming\", \"predicate\": \"is a type of\", \"object\": \"Forming\"},\n {\"subject\": \"Hydroforming\", \"predicate\": \"is a type of\", \"object\": \"Forming\"},\n {\"subject\": \"PolyJet\", \"predicate\": \"is a technology used in\", \"object\": \"Additive Manufacturing\"},\n {\"subject\": \"FDM\", \"predicate\": \"is a technology used in\", \"object\": \"Additive Manufacturing\"},\n {\"subject\": \"Additive Manufacturing\", \"predicate\": \"has advantage of\", \"object\": \"Freedom\"},\n {\"subject\": \"Additive Manufacturing\", \"predicate\": \"has advantage of\", \"object\": \"Closed System\"},\n {\"subject\": \"Additive Manufacturing\", \"predicate\": \"has advantage of\", \"object\": \"Quic",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 69.09,
|
||||
"local_candidates": [
|
||||
"ASTM international committee F42",
|
||||
"Wohler's Report 2014",
|
||||
"3D Printing",
|
||||
"Additive Manufacturing",
|
||||
"Cutting",
|
||||
"Subtractive Manufacturing",
|
||||
"Forming",
|
||||
"Laser cutting",
|
||||
"Yellow Green Orange fluorescent marker",
|
||||
"photosteve101",
|
||||
"Carving",
|
||||
"Drilling",
|
||||
"Milling",
|
||||
"Chiseling",
|
||||
"Vacuum Forming",
|
||||
"Dennis van Zuijlekom",
|
||||
"Hydroforming",
|
||||
"Aven Orgnac Salle Sup",
|
||||
"Benh LIEU SONG",
|
||||
"PolyJet",
|
||||
"FDM",
|
||||
"Neolithic Revolution",
|
||||
"1st Industrial Revolution",
|
||||
"2nd Industrial Revolution",
|
||||
"Digital Revolution",
|
||||
"3rd Industrial Revolution",
|
||||
"McKinsey & Company"
|
||||
],
|
||||
"local_raw": "{\n \"candidates\": [\n \"ASTM international committee F42\",\n \"Wohler's Report 2014\",\n \"3D Printing\",\n \"Additive Manufacturing\",\n \"Cutting\",\n \"Subtractive Manufacturing\",\n \"Forming\",\n \"Laser cutting\",\n \"Yellow Green Orange fluorescent marker\",\n \"photosteve101\",\n \"Carving\",\n \"Drilling\",\n \"Milling\",\n \"Chiseling\",\n \"Vacuum Forming\",\n \"Dennis van Zuijlekom\",\n \"Hydroforming\",\n \"Aven Orgnac Salle Sup\",\n \"Benh LIEU SONG\",\n \"PolyJet\",\n \"FDM\",\n \"Neolithic Revolution\",\n \"1st Industrial Revolution\",\n \"2nd Industrial Revolution\",\n \"Digital Revolution\",\n \"3rd Industrial Revolution\",\n \"McKinsey & Company\"\n ]\n}",
|
||||
"api_input_tokens": 1798,
|
||||
"api_output_tokens": 1354,
|
||||
"api_latency_s": 6.58,
|
||||
"entity_count": 26,
|
||||
"edge_count": 29,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Additive Manufacturing\", \"type\": \"Manufacturing Process\"},\n {\"name\": \"3D Printing\", \"type\": \"Manufacturing Process\"},\n {\"name\": \"Cutting\", \"type\": \"Manufacturing Process\"},\n {\"name\": \"Subtractive Manufacturing\", \"type\": \"Manufacturing Process\"},\n {\"name\": \"Forming\", \"type\": \"Manufacturing Process\"},\n {\"name\": \"Laser cutting\", \"type\": \"Manufacturing Technique\"},\n {\"name\": \"Carving\", \"type\": \"Manufacturing Technique\"},\n {\"name\": \"Drilling\", \"type\": \"Manufacturing Technique\"},\n {\"name\": \"Milling\", \"type\": \"Manufacturing Technique\"},\n {\"name\": \"Chiseling\", \"type\": \"Manufacturing Technique\"},\n {\"name\": \"Vacuum Forming\", \"type\": \"Manufacturing Technique\"},\n {\"name\": \"Hydroforming\", \"type\": \"Manufacturing Technique\"},\n {\"name\": \"Glass Blowing\", \"type\": \"Manufacturing Technique\"},\n {\"name\": \"PolyJet\", \"type\": \"3D Printing Technology\"},\n {\"name\": \"FDM\", \"type\": \"3D Printing Technology\"},\n {\"name\": \"ASTM international committee F42\", \"type\": \"Standards Organization\"},\n {\"name\": \"Wohler's Report 2014\", \"type\": \"Publication\"},\n {\"name\": \"Neolithic Revolution\", \"type\": \"Historical Period\"},\n {\"name\": \"1st Industrial Revolution\", \"type\": \"Historical Period\"},\n {\"name\": \"2nd Industrial Revolution\", \"type\": \"Historical Period\"},\n {\"name\": \"Digital Revolution\", \"type\": \"Historical Period\"},\n {\"name\": \"3rd Industrial Revolution\", \"type\": \"Historical Period\"},\n {\"name\": \"McKinsey & Company\", \"type\": \"Organization\"},\n {\"name\": \"Dennis van Zuijlekom\", \"type\": \"Photographer\"},\n {\"name\": \"photosteve101\", \"type\": \"Photographer\"},\n {\"name\": \"Benh LIEU SONG\", \"type\": \"Photographer\"}\n ],\n \"edges\": [\n {\"subject\": \"3D Printing\", \"predicate\": \"is a type of\", \"object\": \"Additive Manufacturing\"},\n {\"subject\": \"Additive Manufacturing\", \"predicate\": \"is defined by\", \"object\": \"ASTM international committee F42\"},\n {\"subject\": \"3D Printing\", \"predicate\": \"is defined by\", \"object\": \"ASTM international committee F42\"},\n {\"subject\": \"Additive Manufacturing\", \"predicate\": \"is documented in\", \"object\": \"Wohler's Report 2014\"},\n {\"subject\": \"3D Printing\", \"predicate\": \"is documented in\", \"object\": \"Wohler's Report 2014\"},\n {\"subject\": \"Cutting\", \"predicate\": \"is a type of\", \"object\": \"Manufacturing Process\"},\n {\"subject\": \"Subtractive Manufacturing\", \"predicate\": \"is a type of\", \"object\": \"Manufacturing Process\"},\n {\"subject\": \"Forming\", \"predicate\": \"is a type of\", \"object\": \"Manufacturing Process\"},\n {\"subject\": \"Laser cutting\", \"predicate\": \"is a technique within\", \"object\": \"Cutting\"},\n {\"subject\": \"Carving\", \"predicate\": \"is a technique within\", \"object\": \"Subtractive Manufacturing\"},\n {\"subject\": \"Drilling\", \"predicate\": \"is a technique within\", \"object\": \"Subtractive Manufacturing\"},\n {\"subject\": \"Milling\", \"predicate\": \"is a technique within\", \"object\": \"Subtractive Manufacturing\"},\n {\"subject\": \"Chiseling\", \"predicate\": \"is a technique within\", \"object\": \"Subtractive Manufacturing\"},\n {\"subject\": \"Vacuum Forming\", \"predicate\": \"is a technique within\", \"object\": \"Forming\"},\n {\"subject\": \"Hydroforming\", \"predicate\": \"is a technique within\", \"object\": \"Forming\"},\n {\"subject\": \"Glass Blowing\", \"predicate\": \"is a technique within\", \"object\": \"Forming\"},\n {\"subject\": \"PolyJet\", \"predicate\": \"is a technology used in\", \"object\": \"Additive Manufacturing\"},\n {\"subject\": \"FDM\", \"predicate\": \"is a technology used in\", \"object\": \"Additive Manufacturing\"},\n {\"subject\": \"Additive Manufacturing\", \"predicate\": \"offers advantage of\", \"object\": \"Freedom\"},\n {\"subject\": \"Additive Manufacturing\", \"predicate\": \"offers advantage of\", \"object\": \"Quick Production\"},\n {\"subject\": \"Additive Manufacturing\", \"predicate\": \"offers advantage of\", \"object\": \"Multiple Materials, One Print\"},\n {\"subject\": \"Additive Manufacturing\", \"predicate\": \"offers advantage of\", \"object\": \"Closed System\"},\n {\"subject\": \"1st",
|
||||
"error": null
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "01_NURBS Curves.docx",
|
||||
"size_bucket": "medium",
|
||||
"doc_chars_original": 1401,
|
||||
"doc_chars_sent": 1401,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 506,
|
||||
"output_tokens": 1479,
|
||||
"latency_s": 6.58,
|
||||
"entity_count": 31,
|
||||
"edge_count": 36,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Rhino\", \"type\": \"Software\"},\n {\"name\": \"NURBS\", \"type\": \"Mathematical Representation\"},\n {\"name\": \"Geometry\", \"type\": \"Concept\"},\n {\"name\": \"3-D geometry\", \"type\": \"Geometric Form\"},\n {\"name\": \"2-D line\", \"type\": \"Geometric Shape\"},\n {\"name\": \"circle\", \"type\": \"Geometric Shape\"},\n {\"name\": \"arc\", \"type\": \"Geometric Shape\"},\n {\"name\": \"curve\", \"type\": \"Geometric Shape\"},\n {\"name\": \"3-D organic free-form surface\", \"type\": \"Geometric Shape\"},\n {\"name\": \"solid\", \"type\": \"Geometric Shape\"},\n {\"name\": \"illustration\", \"type\": \"Process\"},\n {\"name\": \"animation\", \"type\": \"Process\"},\n {\"name\": \"manufacturing\", \"type\": \"Process\"},\n {\"name\": \"point\", \"type\": \"Geometric Element\"},\n {\"name\": \"Cartesian space\", \"type\": \"Coordinate System\"},\n {\"name\": \"X axis\", \"type\": \"Coordinate Axis\"},\n {\"name\": \"Y axis\", \"type\": \"Coordinate Axis\"},\n {\"name\": \"Z axis\", \"type\": \"Coordinate Axis\"},\n {\"name\": \"2D geometry\", \"type\": \"Geometric Form\"},\n {\"name\": \"3D geometry\", \"type\": \"Geometric Form\"},\n {\"name\": \"line\", \"type\": \"Geometric Shape\"},\n {\"name\": \"polygon\", \"type\": \"Geometric Shape\"},\n {\"name\": \"ellipse\", \"type\": \"Geometric Shape\"},\n {\"name\": \"helix\", \"type\": \"Geometric Shape\"},\n {\"name\": \"spiral\", \"type\": \"Geometric Shape\"},\n {\"name\": \"open curve\", \"type\": \"Curve Type\"},\n {\"name\": \"closed curve\", \"type\": \"Curve Type\"},\n {\"name\": \"planar curve\", \"type\": \"Curve Type\"},\n {\"name\": \"non-planar curve\", \"type\": \"Curve Type\"},\n {\"name\": \"single curve\", \"type\": \"Curve Category\"},\n {\"name\": \"polycurve\", \"type\": \"Curve Category\"}\n ],\n \"edges\": [\n {\"subject\": \"Rhino\", \"predicate\": \"represents all geometry with\", \"object\": \"NURBS\"},\n {\"subject\": \"NURBS\", \"predicate\": \"are mathematical representations of\", \"object\": \"3-D geometry\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"2-D line\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"circle\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"arc\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"curve\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"3-D organic free-form surface\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"solid\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"illustration\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"animation\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"manufacturing\"},\n {\"subject\": \"point\", \"predicate\": \"is the most basic element in\", \"object\": \"Rhino\"},\n {\"subject\": \"point\", \"predicate\": \"represents a position in\", \"object\": \"Cartesian space\"},\n {\"subject\": \"position\", \"predicate\": \"is expressed along\", \"object\": \"X axis\"},\n {\"subject\": \"position\", \"predicate\": \"is expressed along\", \"object\": \"Y axis\"},\n {\"subject\": \"position\", \"predicate\": \"is expressed along\", \"object\": \"Z axis\"},\n {\"subject\": \"points\", \"predicate\": \"are used to define\", \"object\": \"2D geometry\"},\n {\"subject\": \"points\", \"predicate\": \"are used to define\", \"object\": \"3D geometry\"},\n {\"subject\": \"curve\", \"predicate\": \"is a 2D geometric form in Rhino\", \"object\": \"Geometry\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"line\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"circle\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"arc\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"polygon\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"ellipse\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"helix\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"spiral\"},\n {\"subject\": \"curve\", \"predicate\": \"can be\", \"object\": \"open curve\"},\n {\"subject\": \"curve\", \"predicate\": \"can be\", \"object\": ",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 14.08,
|
||||
"local_candidates": [
|
||||
"Rhino",
|
||||
"Non-Uniform Rational B-Splines (NURBS)",
|
||||
"X",
|
||||
"Y",
|
||||
"Z",
|
||||
"Cartesian space",
|
||||
"Point",
|
||||
"Line",
|
||||
"Curve",
|
||||
"Arc",
|
||||
"Circle",
|
||||
"Polygon",
|
||||
"Ellipse",
|
||||
"Helix",
|
||||
"Spiral"
|
||||
],
|
||||
"local_raw": "{\n \"candidates\": [\"Rhino\", \"Non-Uniform Rational B-Splines (NURBS)\", \"X\", \"Y\", \"Z\", \"Cartesian space\", \"Point\", \"Line\", \"Curve\", \"Arc\", \"Circle\", \"Polygon\", \"Ellipse\", \"Helix\", \"Spiral\"]\n}",
|
||||
"api_input_tokens": 616,
|
||||
"api_output_tokens": 991,
|
||||
"api_latency_s": 5.02,
|
||||
"entity_count": 18,
|
||||
"edge_count": 23,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Rhino\", \"type\": \"Software\"},\n {\"name\": \"Non-Uniform Rational B-Splines (NURBS)\", \"type\": \"Mathematical Representation\"},\n {\"name\": \"X\", \"type\": \"Axis\"},\n {\"name\": \"Y\", \"type\": \"Axis\"},\n {\"name\": \"Z\", \"type\": \"Axis\"},\n {\"name\": \"Cartesian space\", \"type\": \"Coordinate System\"},\n {\"name\": \"Point\", \"type\": \"Geometric Element\"},\n {\"name\": \"Line\", \"type\": \"Curve Type\"},\n {\"name\": \"Curve\", \"type\": \"Geometric Element\"},\n {\"name\": \"Arc\", \"type\": \"Curve Type\"},\n {\"name\": \"Circle\", \"type\": \"Curve Type\"},\n {\"name\": \"Polygon\", \"type\": \"Curve Type\"},\n {\"name\": \"Ellipse\", \"type\": \"Curve Type\"},\n {\"name\": \"Helix\", \"type\": \"Curve Type\"},\n {\"name\": \"Spiral\", \"type\": \"Curve Type\"},\n {\"name\": \"Surface\", \"type\": \"Geometric Element\"},\n {\"name\": \"Solid\", \"type\": \"Geometric Element\"},\n {\"name\": \"Polycurve\", \"type\": \"Geometric Element\"}\n ],\n \"edges\": [\n {\"subject\": \"Rhino\", \"predicate\": \"represents all geometry using\", \"object\": \"Non-Uniform Rational B-Splines (NURBS)\"},\n {\"subject\": \"Non-Uniform Rational B-Splines (NURBS)\", \"predicate\": \"can describe\", \"object\": \"Line\"},\n {\"subject\": \"Non-Uniform Rational B-Splines (NURBS)\", \"predicate\": \"can describe\", \"object\": \"Circle\"},\n {\"subject\": \"Non-Uniform Rational B-Splines (NURBS)\", \"predicate\": \"can describe\", \"object\": \"Arc\"},\n {\"subject\": \"Non-Uniform Rational B-Splines (NURBS)\", \"predicate\": \"can describe\", \"object\": \"Curve\"},\n {\"subject\": \"Non-Uniform Rational B-Splines (NURBS)\", \"predicate\": \"can describe\", \"object\": \"Surface\"},\n {\"subject\": \"Non-Uniform Rational B-Splines (NURBS)\", \"predicate\": \"can describe\", \"object\": \"Solid\"},\n {\"subject\": \"Point\", \"predicate\": \"represents position in\", \"object\": \"Cartesian space\"},\n {\"subject\": \"Point\", \"predicate\": \"is expressed along\", \"object\": \"X\"},\n {\"subject\": \"Point\", \"predicate\": \"is expressed along\", \"object\": \"Y\"},\n {\"subject\": \"Point\", \"predicate\": \"is expressed along\", \"object\": \"Z\"},\n {\"subject\": \"Point\", \"predicate\": \"is the most basic element in\", \"object\": \"Rhino\"},\n {\"subject\": \"Curve\", \"predicate\": \"includes\", \"object\": \"Line\"},\n {\"subject\": \"Curve\", \"predicate\": \"includes\", \"object\": \"Arc\"},\n {\"subject\": \"Curve\", \"predicate\": \"includes\", \"object\": \"Circle\"},\n {\"subject\": \"Curve\", \"predicate\": \"includes\", \"object\": \"Polygon\"},\n {\"subject\": \"Curve\", \"predicate\": \"includes\", \"object\": \"Ellipse\"},\n {\"subject\": \"Curve\", \"predicate\": \"includes\", \"object\": \"Helix\"},\n {\"subject\": \"Curve\", \"predicate\": \"includes\", \"object\": \"Spiral\"},\n {\"subject\": \"Curve\", \"predicate\": \"are building blocks for\", \"object\": \"Surface\"},\n {\"subject\": \"Curve\", \"predicate\": \"are building blocks for\", \"object\": \"Solid\"},\n {\"subject\": \"Polycurve\", \"predicate\": \"is composed of multiple\", \"object\": \"Curve\"},\n {\"subject\": \"Curve\", \"predicate\": \"can be joined to\", \"object\": \"Curve\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "02_Point of Curves - AARON.docx",
|
||||
"size_bucket": "medium",
|
||||
"doc_chars_original": 2116,
|
||||
"doc_chars_sent": 2116,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 667,
|
||||
"output_tokens": 2012,
|
||||
"latency_s": 9.5,
|
||||
"entity_count": 44,
|
||||
"edge_count": 48,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Rhino\", \"type\": \"Software\"},\n {\"name\": \"NURBS\", \"type\": \"Mathematical Representation\"},\n {\"name\": \"Geometry\", \"type\": \"Concept\"},\n {\"name\": \"3-D geometry\", \"type\": \"Geometric Type\"},\n {\"name\": \"2-D line\", \"type\": \"Geometric Shape\"},\n {\"name\": \"circle\", \"type\": \"Geometric Shape\"},\n {\"name\": \"arc\", \"type\": \"Geometric Shape\"},\n {\"name\": \"curve\", \"type\": \"Geometric Shape\"},\n {\"name\": \"3-D organic free-form surface\", \"type\": \"Geometric Shape\"},\n {\"name\": \"solid\", \"type\": \"Geometric Shape\"},\n {\"name\": \"point\", \"type\": \"Geometric Element\"},\n {\"name\": \"3D Cartesian space\", \"type\": \"Coordinate System\"},\n {\"name\": \"X-axis\", \"type\": \"Coordinate Axis\"},\n {\"name\": \"Y-axis\", \"type\": \"Coordinate Axis\"},\n {\"name\": \"Z-axis\", \"type\": \"Coordinate Axis\"},\n {\"name\": \"2D geometry\", \"type\": \"Geometric Type\"},\n {\"name\": \"3D geometry\", \"type\": \"Geometric Type\"},\n {\"name\": \"line\", \"type\": \"Geometric Shape\"},\n {\"name\": \"polygon\", \"type\": \"Geometric Shape\"},\n {\"name\": \"ellipse\", \"type\": \"Geometric Shape\"},\n {\"name\": \"helix\", \"type\": \"Geometric Shape\"},\n {\"name\": \"spiral\", \"type\": \"Geometric Shape\"},\n {\"name\": \"open curve\", \"type\": \"Curve Type\"},\n {\"name\": \"closed curve\", \"type\": \"Curve Type\"},\n {\"name\": \"planar curve\", \"type\": \"Curve Type\"},\n {\"name\": \"non-planar curve\", \"type\": \"Curve Type\"},\n {\"name\": \"single curve\", \"type\": \"Curve Type\"},\n {\"name\": \"polycurve\", \"type\": \"Curve Type\"},\n {\"name\": \"polyline\", \"type\": \"Geometric Shape\"},\n {\"name\": \"illustration\", \"type\": \"Application\"},\n {\"name\": \"animation\", \"type\": \"Application\"},\n {\"name\": \"manufacturing\", \"type\": \"Application\"},\n {\"name\": \"small object template\", \"type\": \"Template\"},\n {\"name\": \"Top viewport\", \"type\": \"Viewport\"},\n {\"name\": \"Circle layer\", \"type\": \"Layer\"},\n {\"name\": \"Star layer\", \"type\": \"Layer\"},\n {\"name\": \"Rectangle layer\", \"type\": \"Layer\"},\n {\"name\": \"Polyline layer\", \"type\": \"Layer\"},\n {\"name\": \"Curve layer\", \"type\": \"Layer\"},\n {\"name\": \"Shape layer\", \"type\": \"Layer\"},\n {\"name\": \"Star\", \"type\": \"Geometric Shape\"},\n {\"name\": \"Rectangle\", \"type\": \"Geometric Shape\"},\n {\"name\": \"Rectangle with rounded corners\", \"type\": \"Geometric Shape\"},\n {\"name\": \"Open Poly-line\", \"type\": \"Geometric Shape\"}\n ],\n \"edges\": [\n {\"subject\": \"Rhino\", \"predicate\": \"represents all geometry with\", \"object\": \"NURBS\"},\n {\"subject\": \"NURBS\", \"predicate\": \"is a\", \"object\": \"Mathematical Representation\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"2-D line\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"circle\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"arc\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"curve\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"3-D organic free-form surface\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"solid\"},\n {\"subject\": \"NURBS\", \"predicate\": \"have\", \"object\": \"flexibility\"},\n {\"subject\": \"NURBS\", \"predicate\": \"have\", \"object\": \"accuracy\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"illustration\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"animation\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"manufacturing\"},\n {\"subject\": \"point\", \"predicate\": \"is the most basic element in\", \"object\": \"Rhino\"},\n {\"subject\": \"point\", \"predicate\": \"represents a position in\", \"object\": \"3D Cartesian space\"},\n {\"subject\": \"point\", \"predicate\": \"is expressed as coordinates along\", \"object\": \"X-axis\"},\n {\"subject\": \"point\", \"predicate\": \"is expressed as coordinates along\", \"object\": \"Y-axis\"},\n {\"subject\": \"point\", \"predicate\": \"is expressed as coordinates along\", \"object\": \"Z-axis\"},\n {\"subject\": \"points\", \"predicate\": \"are building blocks that define\", \"object\": \"2D geometry\"},\n {\"s",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 18.77,
|
||||
"local_candidates": [
|
||||
"Rhino",
|
||||
"NURBS",
|
||||
"Point",
|
||||
"X-axis",
|
||||
"Y-axis",
|
||||
"Z-axis",
|
||||
"Curve",
|
||||
"Line",
|
||||
"Arc",
|
||||
"Circle",
|
||||
"Polygon",
|
||||
"Ellipse",
|
||||
"Helix",
|
||||
"Spiral",
|
||||
"Polycurve",
|
||||
"Homework",
|
||||
"small object template",
|
||||
"Top viewport",
|
||||
"Layer name",
|
||||
"Circle",
|
||||
"Star",
|
||||
"Rectangle",
|
||||
"Rectangle with rounded corners",
|
||||
"Polyline",
|
||||
"Open Poly-line",
|
||||
"Closed Curve",
|
||||
"Shape"
|
||||
],
|
||||
"local_raw": "{\n \"candidates\": [\"Rhino\", \"NURBS\", \"Point\", \"X-axis\", \"Y-axis\", \"Z-axis\", \"Curve\", \"Line\", \"Arc\", \"Circle\", \"Polygon\", \"Ellipse\", \"Helix\", \"Spiral\", \"Polycurve\", \"Homework\", \"small object template\", \"Top viewport\", \"Layer name\", \"Circle\", \"Star\", \"Rectangle\", \"Rectangle with rounded corners\", \"Polyline\", \"Open Poly-line\", \"Closed Curve\", \"Shape\"]\n}",
|
||||
"api_input_tokens": 824,
|
||||
"api_output_tokens": 1447,
|
||||
"api_latency_s": 6.54,
|
||||
"entity_count": 29,
|
||||
"edge_count": 35,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Rhino\", \"type\": \"Software\"},\n {\"name\": \"NURBS\", \"type\": \"Mathematical Representation\"},\n {\"name\": \"Point\", \"type\": \"Geometric Element\"},\n {\"name\": \"X-axis\", \"type\": \"Coordinate Axis\"},\n {\"name\": \"Y-axis\", \"type\": \"Coordinate Axis\"},\n {\"name\": \"Z-axis\", \"type\": \"Coordinate Axis\"},\n {\"name\": \"3D Cartesian space\", \"type\": \"Coordinate System\"},\n {\"name\": \"Curve\", \"type\": \"Geometric Entity\"},\n {\"name\": \"Line\", \"type\": \"Curve Type\"},\n {\"name\": \"Arc\", \"type\": \"Curve Type\"},\n {\"name\": \"Circle\", \"type\": \"Curve Type\"},\n {\"name\": \"Polygon\", \"type\": \"Curve Type\"},\n {\"name\": \"Ellipse\", \"type\": \"Curve Type\"},\n {\"name\": \"Helix\", \"type\": \"Curve Type\"},\n {\"name\": \"Spiral\", \"type\": \"Curve Type\"},\n {\"name\": \"Polycurve\", \"type\": \"Geometric Entity\"},\n {\"name\": \"2D geometry\", \"type\": \"Geometry Type\"},\n {\"name\": \"3D geometry\", \"type\": \"Geometry Type\"},\n {\"name\": \"Polyline\", \"type\": \"Geometric Entity\"},\n {\"name\": \"Open Poly-line\", \"type\": \"Polyline Type\"},\n {\"name\": \"Closed Curve\", \"type\": \"Curve Type\"},\n {\"name\": \"Homework 1\", \"type\": \"Assignment\"},\n {\"name\": \"small object template\", \"type\": \"Document Template\"},\n {\"name\": \"Top viewport\", \"type\": \"Viewport\"},\n {\"name\": \"Layer\", \"type\": \"Document Organization\"},\n {\"name\": \"Rectangle\", \"type\": \"Curve Type\"},\n {\"name\": \"Rectangle with rounded corners\", \"type\": \"Curve Type\"},\n {\"name\": \"Star\", \"type\": \"Shape\"},\n {\"name\": \"Shape\", \"type\": \"Geometric Entity\"}\n ],\n \"edges\": [\n {\"subject\": \"Rhino\", \"predicate\": \"represents all geometry with\", \"object\": \"NURBS\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"2D geometry\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can describe\", \"object\": \"3D geometry\"},\n {\"subject\": \"Point\", \"predicate\": \"represents position in\", \"object\": \"3D Cartesian space\"},\n {\"subject\": \"Point\", \"predicate\": \"is expressed as coordinates along\", \"object\": \"X-axis\"},\n {\"subject\": \"Point\", \"predicate\": \"is expressed as coordinates along\", \"object\": \"Y-axis\"},\n {\"subject\": \"Point\", \"predicate\": \"is expressed as coordinates along\", \"object\": \"Z-axis\"},\n {\"subject\": \"Point\", \"predicate\": \"is basic building block for\", \"object\": \"2D geometry\"},\n {\"subject\": \"Point\", \"predicate\": \"is basic building block for\", \"object\": \"3D geometry\"},\n {\"subject\": \"Curve\", \"predicate\": \"is basic geometric entity in\", \"object\": \"Rhino\"},\n {\"subject\": \"Curve\", \"predicate\": \"is generalization of\", \"object\": \"Line\"},\n {\"subject\": \"Curve\", \"predicate\": \"includes\", \"object\": \"Line\"},\n {\"subject\": \"Curve\", \"predicate\": \"includes\", \"object\": \"Arc\"},\n {\"subject\": \"Curve\", \"predicate\": \"includes\", \"object\": \"Circle\"},\n {\"subject\": \"Curve\", \"predicate\": \"includes\", \"object\": \"Polygon\"},\n {\"subject\": \"Curve\", \"predicate\": \"includes\", \"object\": \"Ellipse\"},\n {\"subject\": \"Curve\", \"predicate\": \"includes\", \"object\": \"Helix\"},\n {\"subject\": \"Curve\", \"predicate\": \"includes\", \"object\": \"Spiral\"},\n {\"subject\": \"Curve\", \"predicate\": \"is building block for\", \"object\": \"complex geometries\"},\n {\"subject\": \"Polycurve\", \"predicate\": \"is composed of\", \"object\": \"Curve\"},\n {\"subject\": \"Polycurve\", \"predicate\": \"joins curves at\", \"object\": \"end points\"},\n {\"subject\": \"2D geometry\", \"predicate\": \"is comprised of\", \"object\": \"Curve\"},\n {\"subject\": \"2D geometry\", \"predicate\": \"is comprised of\", \"object\": \"Line\"},\n {\"subject\": \"2D geometry\", \"predicate\": \"is comprised of\", \"object\": \"Polyline\"},\n {\"subject\": \"Homework 1\", \"predicate\": \"requires creating document with\", \"object\": \"small object template\"},\n {\"subject\": \"Homework 1\", \"predicate\": \"requires working in\", \"object\": \"Top viewport\"},\n {\"subject\": \"Homework 1\", \"predicate\": \"requires creating\", \"object\": \"Circle\"},\n {\"subject\": \"Homework 1\", \"predicate\": \"requires creating\", \"object\": \"Star\"},\n {\"subject\": \"Homework 1\", \"predicate\": \"requires creating\", \"object\": \"Re",
|
||||
"error": null
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "02_Point of Curves.docx",
|
||||
"size_bucket": "medium",
|
||||
"doc_chars_original": 1338,
|
||||
"doc_chars_sent": 1338,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 491,
|
||||
"output_tokens": 1392,
|
||||
"latency_s": 6.59,
|
||||
"entity_count": 30,
|
||||
"edge_count": 34,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Rhino\", \"type\": \"Software\"},\n {\"name\": \"NURBS\", \"type\": \"Mathematical Representation\"},\n {\"name\": \"Geometry\", \"type\": \"Concept\"},\n {\"name\": \"3-D geometry\", \"type\": \"Geometric Form\"},\n {\"name\": \"2-D line\", \"type\": \"Geometric Shape\"},\n {\"name\": \"circle\", \"type\": \"Geometric Shape\"},\n {\"name\": \"arc\", \"type\": \"Geometric Shape\"},\n {\"name\": \"curve\", \"type\": \"Geometric Shape\"},\n {\"name\": \"3-D organic free-form surface\", \"type\": \"Geometric Shape\"},\n {\"name\": \"solid\", \"type\": \"Geometric Shape\"},\n {\"name\": \"illustration\", \"type\": \"Application\"},\n {\"name\": \"animation\", \"type\": \"Application\"},\n {\"name\": \"manufacturing\", \"type\": \"Application\"},\n {\"name\": \"point\", \"type\": \"Geometric Element\"},\n {\"name\": \"area\", \"type\": \"Property\"},\n {\"name\": \"volume\", \"type\": \"Property\"},\n {\"name\": \"XYZ intersection\", \"type\": \"Coordinate System\"},\n {\"name\": \"2D geometry\", \"type\": \"Geometric Form\"},\n {\"name\": \"3D geometry\", \"type\": \"Geometric Form\"},\n {\"name\": \"line\", \"type\": \"Geometric Shape\"},\n {\"name\": \"polygon\", \"type\": \"Geometric Shape\"},\n {\"name\": \"ellipse\", \"type\": \"Geometric Shape\"},\n {\"name\": \"helix\", \"type\": \"Geometric Shape\"},\n {\"name\": \"spiral\", \"type\": \"Geometric Shape\"},\n {\"name\": \"open curve\", \"type\": \"Curve Type\"},\n {\"name\": \"closed curve\", \"type\": \"Curve Type\"},\n {\"name\": \"planar curve\", \"type\": \"Curve Type\"},\n {\"name\": \"non-planar curve\", \"type\": \"Curve Type\"},\n {\"name\": \"single curve\", \"type\": \"Curve Category\"},\n {\"name\": \"polycurve\", \"type\": \"Curve Category\"}\n ],\n \"edges\": [\n {\"subject\": \"Rhino\", \"predicate\": \"represents all geometry with\", \"object\": \"NURBS\"},\n {\"subject\": \"NURBS\", \"predicate\": \"are mathematical representations of\", \"object\": \"3-D geometry\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"2-D line\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"circle\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"arc\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"curve\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"3-D organic free-form surface\"},\n {\"subject\": \"NURBS\", \"predicate\": \"can accurately describe\", \"object\": \"solid\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"illustration\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"animation\"},\n {\"subject\": \"NURBS models\", \"predicate\": \"can be used in\", \"object\": \"manufacturing\"},\n {\"subject\": \"point\", \"predicate\": \"is the most basic element in\", \"object\": \"Rhino\"},\n {\"subject\": \"point\", \"predicate\": \"represents\", \"object\": \"XYZ intersection\"},\n {\"subject\": \"point\", \"predicate\": \"has no\", \"object\": \"area\"},\n {\"subject\": \"point\", \"predicate\": \"has no\", \"object\": \"volume\"},\n {\"subject\": \"point\", \"predicate\": \"is used to define\", \"object\": \"2D geometry\"},\n {\"subject\": \"point\", \"predicate\": \"is used to define\", \"object\": \"3D geometry\"},\n {\"subject\": \"curve\", \"predicate\": \"is a 2D geometric form in Rhino\", \"object\": \"Geometry\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"line\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"circle\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"arc\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"polygon\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"ellipse\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"helix\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"spiral\"},\n {\"subject\": \"curve\", \"predicate\": \"can be\", \"object\": \"open curve\"},\n {\"subject\": \"curve\", \"predicate\": \"can be\", \"object\": \"closed curve\"},\n {\"subject\": \"curve\", \"predicate\": \"can be\", \"object\": \"planar curve\"},\n {\"subject\": \"curve\", \"predicate\": \"can be\", \"object\": \"non-planar curve\"},\n {\"subject\": ",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 10.39,
|
||||
"local_candidates": [
|
||||
"Rhino",
|
||||
"Non-Uniform Rational B-Splines (NURBS)",
|
||||
"point",
|
||||
"X",
|
||||
"Y",
|
||||
"Z",
|
||||
"line",
|
||||
"curve",
|
||||
"arc",
|
||||
"circle",
|
||||
"polygon",
|
||||
"ellipse",
|
||||
"helices",
|
||||
"spirals",
|
||||
"polycurve"
|
||||
],
|
||||
"local_raw": "{\n \"candidates\": [\"Rhino\", \"Non-Uniform Rational B-Splines (NURBS)\", \"point\", \"X\", \"Y\", \"Z\", \"line\", \"curve\", \"arc\", \"circle\", \"polygon\", \"ellipse\", \"helices\", \"spirals\", \"polycurve\"]\n}",
|
||||
"api_input_tokens": 597,
|
||||
"api_output_tokens": 1015,
|
||||
"api_latency_s": 5.03,
|
||||
"entity_count": 19,
|
||||
"edge_count": 24,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Rhino\", \"type\": \"Software\"},\n {\"name\": \"Non-Uniform Rational B-Splines (NURBS)\", \"type\": \"Mathematical Representation\"},\n {\"name\": \"point\", \"type\": \"Geometric Element\"},\n {\"name\": \"X\", \"type\": \"Coordinate Axis\"},\n {\"name\": \"Y\", \"type\": \"Coordinate Axis\"},\n {\"name\": \"Z\", \"type\": \"Coordinate Axis\"},\n {\"name\": \"line\", \"type\": \"Curve Type\"},\n {\"name\": \"curve\", \"type\": \"Geometric Element\"},\n {\"name\": \"arc\", \"type\": \"Curve Type\"},\n {\"name\": \"circle\", \"type\": \"Curve Type\"},\n {\"name\": \"polygon\", \"type\": \"Curve Type\"},\n {\"name\": \"ellipse\", \"type\": \"Curve Type\"},\n {\"name\": \"helix\", \"type\": \"Curve Type\"},\n {\"name\": \"spiral\", \"type\": \"Curve Type\"},\n {\"name\": \"polycurve\", \"type\": \"Geometric Element\"},\n {\"name\": \"2D geometry\", \"type\": \"Geometry Type\"},\n {\"name\": \"3D geometry\", \"type\": \"Geometry Type\"},\n {\"name\": \"surface\", \"type\": \"Geometric Element\"},\n {\"name\": \"solid\", \"type\": \"Geometric Element\"}\n ],\n \"edges\": [\n {\"subject\": \"Rhino\", \"predicate\": \"represents all geometry using\", \"object\": \"Non-Uniform Rational B-Splines (NURBS)\"},\n {\"subject\": \"Non-Uniform Rational B-Splines (NURBS)\", \"predicate\": \"can describe\", \"object\": \"line\"},\n {\"subject\": \"Non-Uniform Rational B-Splines (NURBS)\", \"predicate\": \"can describe\", \"object\": \"circle\"},\n {\"subject\": \"Non-Uniform Rational B-Splines (NURBS)\", \"predicate\": \"can describe\", \"object\": \"arc\"},\n {\"subject\": \"Non-Uniform Rational B-Splines (NURBS)\", \"predicate\": \"can describe\", \"object\": \"curve\"},\n {\"subject\": \"Non-Uniform Rational B-Splines (NURBS)\", \"predicate\": \"can describe\", \"object\": \"surface\"},\n {\"subject\": \"Non-Uniform Rational B-Splines (NURBS)\", \"predicate\": \"can describe\", \"object\": \"solid\"},\n {\"subject\": \"point\", \"predicate\": \"represents\", \"object\": \"X\"},\n {\"subject\": \"point\", \"predicate\": \"represents\", \"object\": \"Y\"},\n {\"subject\": \"point\", \"predicate\": \"represents\", \"object\": \"Z\"},\n {\"subject\": \"point\", \"predicate\": \"is the most basic element in\", \"object\": \"Rhino\"},\n {\"subject\": \"point\", \"predicate\": \"used to define\", \"object\": \"2D geometry\"},\n {\"subject\": \"point\", \"predicate\": \"used to define\", \"object\": \"3D geometry\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"line\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"arc\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"circle\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"polygon\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"ellipse\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"helix\"},\n {\"subject\": \"curve\", \"predicate\": \"includes\", \"object\": \"spiral\"},\n {\"subject\": \"curve\", \"predicate\": \"is a building block for\", \"object\": \"3D geometry\"},\n {\"subject\": \"polycurve\", \"predicate\": \"is composed of multiple\", \"object\": \"curve\"},\n {\"subject\": \"polycurve\", \"predicate\": \"is a type of\", \"object\": \"2D geometry\"},\n {\"subject\": \"curve\", \"predicate\": \"is a type of\", \"object\": \"2D geometry\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "02_PPT_ALL_AM_Technologies_for_3DP_v3.pptx",
|
||||
"size_bucket": "large",
|
||||
"doc_chars_original": 9360,
|
||||
"doc_chars_sent": 8000,
|
||||
"truncated": true,
|
||||
"condition_a": {
|
||||
"input_tokens": 2398,
|
||||
"output_tokens": 4096,
|
||||
"latency_s": 18.21,
|
||||
"entity_count": null,
|
||||
"edge_count": null,
|
||||
"stop_reason": "max_tokens",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Additive Manufacturing\", \"type\": \"Technology Domain\"},\n {\"name\": \"ASTM\", \"type\": \"Organization\"},\n {\"name\": \"Material Extrusion\", \"type\": \"AM Process\"},\n {\"name\": \"FDM\", \"type\": \"AM Technology\"},\n {\"name\": \"Fused Deposition Modeling\", \"type\": \"AM Technology\"},\n {\"name\": \"FFF\", \"type\": \"AM Technology\"},\n {\"name\": \"Fused Filament Fabrication\", \"type\": \"AM Technology\"},\n {\"name\": \"Vat Photopolymerization\", \"type\": \"AM Process\"},\n {\"name\": \"SL\", \"type\": \"AM Technology\"},\n {\"name\": \"SLA\", \"type\": \"AM Technology\"},\n {\"name\": \"Stereolithography\", \"type\": \"AM Technology\"},\n {\"name\": \"DLP\", \"type\": \"AM Technology\"},\n {\"name\": \"Digital Light Processing\", \"type\": \"AM Technology\"},\n {\"name\": \"3SP\", \"type\": \"AM Technology\"},\n {\"name\": \"Powder Bed Fusion\", \"type\": \"AM Process\"},\n {\"name\": \"SLS\", \"type\": \"AM Technology\"},\n {\"name\": \"Selective Laser Sintering\", \"type\": \"AM Technology\"},\n {\"name\": \"DMLS\", \"type\": \"AM Technology\"},\n {\"name\": \"Direct Metal Laser Sintering\", \"type\": \"AM Technology\"},\n {\"name\": \"EBM\", \"type\": \"AM Technology\"},\n {\"name\": \"Electron Beam Melting\", \"type\": \"AM Technology\"},\n {\"name\": \"SHS\", \"type\": \"AM Technology\"},\n {\"name\": \"Selective Heat Sintering\", \"type\": \"AM Technology\"},\n {\"name\": \"Binder Jetting\", \"type\": \"AM Process\"},\n {\"name\": \"CJP\", \"type\": \"AM Technology\"},\n {\"name\": \"ColorJet Printing\", \"type\": \"AM Technology\"},\n {\"name\": \"PP\", \"type\": \"AM Technology\"},\n {\"name\": \"Plaster-based 3D Printing\", \"type\": \"AM Technology\"},\n {\"name\": \"Sheet Lamination\", \"type\": \"AM Process\"},\n {\"name\": \"UC\", \"type\": \"AM Technology\"},\n {\"name\": \"Ultrasonic Consolidation\", \"type\": \"AM Technology\"},\n {\"name\": \"LOM\", \"type\": \"AM Technology\"},\n {\"name\": \"Laminated Object Manufacturing\", \"type\": \"AM Technology\"},\n {\"name\": \"Directed Energy Deposition\", \"type\": \"AM Process\"},\n {\"name\": \"LMD\", \"type\": \"AM Technology\"},\n {\"name\": \"Laser Metal Deposition\", \"type\": \"AM Technology\"},\n {\"name\": \"Material Jetting\", \"type\": \"AM Process\"},\n {\"name\": \"MJP\", \"type\": \"AM Technology\"},\n {\"name\": \"MultiJet Printing\", \"type\": \"AM Technology\"},\n {\"name\": \"PJ\", \"type\": \"AM Technology\"},\n {\"name\": \"PolyJet\", \"type\": \"AM Technology\"},\n {\"name\": \"Photopolymer Jetting\", \"type\": \"AM Technology\"},\n {\"name\": \"LM\", \"type\": \"AM Technology\"},\n {\"name\": \"Laser Melting\", \"type\": \"AM Technology\"},\n {\"name\": \"SLM\", \"type\": \"AM Technology\"},\n {\"name\": \"Selective Laser Melting\", \"type\": \"AM Technology\"},\n {\"name\": \"BJ\", \"type\": \"AM Process\"},\n {\"name\": \"MJ\", \"type\": \"AM Process\"},\n {\"name\": \"3D Printing\", \"type\": \"Manufacturing Process\"},\n {\"name\": \"Prototypes\", \"type\": \"Application\"},\n {\"name\": \"Manufacturing Aides\", \"type\": \"Application\"},\n {\"name\": \"Small series parts\", \"type\": \"Application\"},\n {\"name\": \"Casting patterns\", \"type\": \"Application\"},\n {\"name\": \"Support parts\", \"type\": \"Application\"},\n {\"name\": \"Tools for injection molds\", \"type\": \"Application\"},\n {\"name\": \"Green parts\", \"type\": \"Application\"},\n {\"name\": \"Molds and cores\", \"type\": \"Application\"},\n {\"name\": \"Lost wax casting\", \"type\": \"Application\"},\n {\"name\": \"Jewelry\", \"type\": \"Application\"},\n {\"name\": \"Dental\", \"type\": \"Application\"},\n {\"name\": \"Photopolymers\", \"type\": \"Material\"},\n {\"name\": \"Plastics\", \"type\": \"Material\"},\n {\"name\": \"Metals\", \"type\": \"Material\"},\n {\"name\": \"Thermoplastics\", \"type\": \"Material\"},\n {\"name\": \"Powder\", \"type\": \"Material Form\"},\n {\"name\": \"UV-active photopolymers\", \"type\": \"Material\"},\n {\"name\": \"Nylon\", \"type\": \"Material\"},\n {\"name\": \"Wax-like materials\", \"type\": \"Material\"},\n {\"name\": \"Anisotropy\", \"type\": \"Property\"},\n {\"name\": \"Mechanical properties\", \"type\": \"Property\"},\n {\"name\": \"Surface finishes\", \"type\": \"Property\"},\n {\"name\": \"Accuracy\", \"type\": \"Property\"},\n {\"name\": \"Density\", \"type\": \"Property\"},\n {\"name",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 100.44,
|
||||
"local_candidates": [
|
||||
"Additive Manufacturing",
|
||||
"AM",
|
||||
"ASTM",
|
||||
"American Society for Testing Materials",
|
||||
"Material Extrusion",
|
||||
"FDM",
|
||||
"Fused Deposition Modeling",
|
||||
"FFF",
|
||||
"Fused Filament Fabrication",
|
||||
"Vat Photopolymerization",
|
||||
"SL",
|
||||
"Stereolithography",
|
||||
"DLP",
|
||||
"Digital Light Processing",
|
||||
"Powder Bed Fusion",
|
||||
"SLS",
|
||||
"Direct Metal Laser Sintering",
|
||||
"EBM",
|
||||
"Electron Beam Melting",
|
||||
"SHS",
|
||||
"Selective Heat Sintering",
|
||||
"Binder Jetting",
|
||||
"CJP",
|
||||
"ColorJet Printing",
|
||||
"PP",
|
||||
"Plaster-based 3D Printing",
|
||||
"Sheet Lamination",
|
||||
"UC",
|
||||
"Ultrasonic Consolidation",
|
||||
"LOM",
|
||||
"Laminated Object Manufacturing"
|
||||
],
|
||||
"local_raw": "{\n \"candidates\": [\n \"Additive Manufacturing\",\n \"AM\",\n \"ASTM\",\n \"American Society for Testing Materials\",\n \"Material Extrusion\",\n \"FDM\",\n \"Fused Deposition Modeling\",\n \"FFF\",\n \"Fused Filament Fabrication\",\n \"Vat Photopolymerization\",\n \"SL\",\n \"Stereolithography\",\n \"DLP\",\n \"Digital Light Processing\",\n \"Powder Bed Fusion\",\n \"SLS\",\n \"Direct Metal Laser Sintering\",\n \"EBM\",\n \"Electron Beam Melting\",\n \"SHS\",\n \"Selective Heat Sintering\",\n \"Binder Jetting\",\n \"CJP\",\n \"ColorJet Printing\",\n \"PP\",\n \"Plaster-based 3D Printing\",\n \"Sheet Lamination\",\n \"UC\",\n \"Ultrasonic Consolidation\",\n \"LOM\",\n \"Laminated Object Manufacturing\"\n ]\n}",
|
||||
"api_input_tokens": 2635,
|
||||
"api_output_tokens": 4096,
|
||||
"api_latency_s": 18.61,
|
||||
"entity_count": null,
|
||||
"edge_count": null,
|
||||
"stop_reason": "max_tokens",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Additive Manufacturing\", \"type\": \"Technology Category\"},\n {\"name\": \"AM\", \"type\": \"Abbreviation\"},\n {\"name\": \"ASTM\", \"type\": \"Organization\"},\n {\"name\": \"American Society for Testing Materials\", \"type\": \"Organization\"},\n {\"name\": \"Material Extrusion\", \"type\": \"AM Process\"},\n {\"name\": \"FDM\", \"type\": \"Technology\"},\n {\"name\": \"Fused Deposition Modeling\", \"type\": \"Technology\"},\n {\"name\": \"FFF\", \"type\": \"Technology\"},\n {\"name\": \"Fused Filament Fabrication\", \"type\": \"Technology\"},\n {\"name\": \"Vat Photopolymerization\", \"type\": \"AM Process\"},\n {\"name\": \"SL\", \"type\": \"Technology\"},\n {\"name\": \"SLA\", \"type\": \"Technology\"},\n {\"name\": \"Stereolithography\", \"type\": \"Technology\"},\n {\"name\": \"DLP\", \"type\": \"Technology\"},\n {\"name\": \"Digital Light Processing\", \"type\": \"Technology\"},\n {\"name\": \"3SP\", \"type\": \"Technology\"},\n {\"name\": \"Scan, Spin, & Selectively Photocure\", \"type\": \"Technology\"},\n {\"name\": \"Powder Bed Fusion\", \"type\": \"AM Process\"},\n {\"name\": \"SLS\", \"type\": \"Technology\"},\n {\"name\": \"Selective Laser Sintering\", \"type\": \"Technology\"},\n {\"name\": \"DMLS\", \"type\": \"Technology\"},\n {\"name\": \"Direct Metal Laser Sintering\", \"type\": \"Technology\"},\n {\"name\": \"EBM\", \"type\": \"Technology\"},\n {\"name\": \"Electron Beam Melting\", \"type\": \"Technology\"},\n {\"name\": \"SHS\", \"type\": \"Technology\"},\n {\"name\": \"Selective Heat Sintering\", \"type\": \"Technology\"},\n {\"name\": \"LM\", \"type\": \"Technology\"},\n {\"name\": \"Laser Melting\", \"type\": \"Technology\"},\n {\"name\": \"SLM\", \"type\": \"Technology\"},\n {\"name\": \"Binder Jetting\", \"type\": \"AM Process\"},\n {\"name\": \"BJ\", \"type\": \"Abbreviation\"},\n {\"name\": \"CJP\", \"type\": \"Technology\"},\n {\"name\": \"ColorJet Printing\", \"type\": \"Technology\"},\n {\"name\": \"PP\", \"type\": \"Technology\"},\n {\"name\": \"Plaster-based 3D Printing\", \"type\": \"Technology\"},\n {\"name\": \"Sheet Lamination\", \"type\": \"AM Process\"},\n {\"name\": \"UC\", \"type\": \"Technology\"},\n {\"name\": \"Ultrasonic Consolidation\", \"type\": \"Technology\"},\n {\"name\": \"LOM\", \"type\": \"Technology\"},\n {\"name\": \"Laminated Object Manufacturing\", \"type\": \"Technology\"},\n {\"name\": \"Directed Energy Deposition\", \"type\": \"AM Process\"},\n {\"name\": \"LMD\", \"type\": \"Technology\"},\n {\"name\": \"Laser Metal Deposition\", \"type\": \"Technology\"},\n {\"name\": \"Material Jetting\", \"type\": \"AM Process\"},\n {\"name\": \"MJ\", \"type\": \"Abbreviation\"},\n {\"name\": \"MJP\", \"type\": \"Technology\"},\n {\"name\": \"PJ\", \"type\": \"Abbreviation\"},\n {\"name\": \"PolyJet\", \"type\": \"Technology\"},\n {\"name\": \"MultiJet Printing\", \"type\": \"Technology\"},\n {\"name\": \"Photopolymer Jetting\", \"type\": \"Technology\"}\n ],\n \"edges\": [\n {\"subject\": \"ASTM\", \"predicate\": \"defines\", \"object\": \"Additive Manufacturing\"},\n {\"subject\": \"American Society for Testing Materials\", \"predicate\": \"is also known as\", \"object\": \"ASTM\"},\n {\"subject\": \"Material Extrusion\", \"predicate\": \"is an AM process defined by\", \"object\": \"ASTM\"},\n {\"subject\": \"FDM\", \"predicate\": \"is a technology within\", \"object\": \"Material Extrusion\"},\n {\"subject\": \"Fused Deposition Modeling\", \"predicate\": \"is also known as\", \"object\": \"FDM\"},\n {\"subject\": \"FFF\", \"predicate\": \"is also known as\", \"object\": \"FDM\"},\n {\"subject\": \"Fused Filament Fabrication\", \"predicate\": \"is also known as\", \"object\": \"FDM\"},\n {\"subject\": \"Vat Photopolymerization\", \"predicate\": \"is an AM process defined by\", \"object\": \"ASTM\"},\n {\"subject\": \"SL\", \"predicate\": \"is a technology within\", \"object\": \"Vat Photopolymerization\"},\n {\"subject\": \"SLA\", \"predicate\": \"is a technology within\", \"object\": \"Vat Photopolymerization\"},\n {\"subject\": \"Stereolithography\", \"predicate\": \"is also known as\", \"object\": \"SL\"},\n {\"subject\": \"Stereolithography\", \"predicate\": \"is also known as\", \"object\": \"SLA\"},\n {\"subject\": \"DLP\", \"predicate\": \"is a technology within\", \"object\": \"Vat Photopolymerization\"},\n {\"subject\": \"Digital Light Processing\", \"",
|
||||
"error": null
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "04_ALL_Materials and Their Properties_v3.pptx",
|
||||
"size_bucket": "large",
|
||||
"doc_chars_original": 25692,
|
||||
"doc_chars_sent": 8000,
|
||||
"truncated": true,
|
||||
"condition_a": {
|
||||
"input_tokens": 2237,
|
||||
"output_tokens": 2692,
|
||||
"latency_s": 13.67,
|
||||
"entity_count": 57,
|
||||
"edge_count": 53,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Stratasys\", \"type\": \"Company\"},\n {\"name\": \"PolyJet\", \"type\": \"3D Printing Technology\"},\n {\"name\": \"FDM\", \"type\": \"3D Printing Technology\"},\n {\"name\": \"PLA\", \"type\": \"FDM Material\"},\n {\"name\": \"ABS-PLUS\", \"type\": \"FDM Material\"},\n {\"name\": \"ABSM30\", \"type\": \"FDM Material\"},\n {\"name\": \"ABSI\", \"type\": \"FDM Material\"},\n {\"name\": \"ABS M30I\", \"type\": \"FDM Material\"},\n {\"name\": \"ABS-ESD7\", \"type\": \"FDM Material\"},\n {\"name\": \"ASA\", \"type\": \"FDM Material\"},\n {\"name\": \"NYLON 12\", \"type\": \"FDM Material\"},\n {\"name\": \"NYLON 6\", \"type\": \"FDM Material\"},\n {\"name\": \"PC-ABS\", \"type\": \"FDM Material\"},\n {\"name\": \"PC-ISO\", \"type\": \"FDM Material\"},\n {\"name\": \"ULTEM 9085\", \"type\": \"FDM Material\"},\n {\"name\": \"ULTEM 1010\", \"type\": \"FDM Material\"},\n {\"name\": \"Veros\", \"type\": \"PolyJet Material\"},\n {\"name\": \"Rigur\", \"type\": \"PolyJet Material\"},\n {\"name\": \"Durus\", \"type\": \"PolyJet Material\"},\n {\"name\": \"Tango\", \"type\": \"PolyJet Material\"},\n {\"name\": \"Agilus\", \"type\": \"PolyJet Material\"},\n {\"name\": \"Digital ABS\", \"type\": \"PolyJet Material\"},\n {\"name\": \"Thermoplastics\", \"type\": \"Material Category\"},\n {\"name\": \"SR support structures\", \"type\": \"Support Structure Type\"},\n {\"name\": \"BASS support structures\", \"type\": \"Support Structure Type\"},\n {\"name\": \"Thermal Resistance\", \"type\": \"ASTM Testing Standard\"},\n {\"name\": \"Tensile Strength\", \"type\": \"ASTM Testing Standard\"},\n {\"name\": \"Flexural Strength\", \"type\": \"ASTM Testing Standard\"},\n {\"name\": \"IZOD Impact\", \"type\": \"ASTM Testing Standard\"},\n {\"name\": \"Coefficient of Thermal Expansion\", \"type\": \"ASTM Testing Standard\"},\n {\"name\": \"Electrical Properties\", \"type\": \"ASTM Testing Standard\"},\n {\"name\": \"Water Absorption\", \"type\": \"ASTM Testing Standard\"},\n {\"name\": \"Shore Hardness\", \"type\": \"ASTM Testing Standard\"},\n {\"name\": \"Heat Deflection Temperature\", \"type\": \"Material Property\"},\n {\"name\": \"Tensile Modulus\", \"type\": \"Material Property\"},\n {\"name\": \"Elongation to Break\", \"type\": \"Material Property\"},\n {\"name\": \"Toughness\", \"type\": \"Material Property\"},\n {\"name\": \"Flexural Modulus\", \"type\": \"Material Property\"},\n {\"name\": \"Static Dissipative\", \"type\": \"Electrical Property\"},\n {\"name\": \"Surface Resistance\", \"type\": \"Electrical Property\"},\n {\"name\": \"Volume Resistance\", \"type\": \"Electrical Property\"},\n {\"name\": \"Shore A\", \"type\": \"Hardness Scale\"},\n {\"name\": \"Shore D\", \"type\": \"Hardness Scale\"},\n {\"name\": \"Materials Data Sheet\", \"type\": \"Document Type\"},\n {\"name\": \"Safety Data Sheet\", \"type\": \"Document Type\"},\n {\"name\": \"SDS\", \"type\": \"Document Type\"},\n {\"name\": \"REACH\", \"type\": \"Regulation\"},\n {\"name\": \"Global Harmonized System\", \"type\": \"Standard\"},\n {\"name\": \"GHS\", \"type\": \"Standard\"},\n {\"name\": \"Rigid Opaque\", \"type\": \"PolyJet Material Family\"},\n {\"name\": \"Rigid Transparent\", \"type\": \"PolyJet Material Family\"},\n {\"name\": \"Simulated Polypropylene Family\", \"type\": \"PolyJet Material Family\"},\n {\"name\": \"Flexible Rubber-like Family\", \"type\": \"PolyJet Material Family\"},\n {\"name\": \"Digital Material\", \"type\": \"PolyJet Material Family\"},\n {\"name\": \"Standard FDM Materials\", \"type\": \"FDM Material Category\"},\n {\"name\": \"Engineering FDM Materials\", \"type\": \"FDM Material Category\"},\n {\"name\": \"High Performance FDM Materials\", \"type\": \"FDM Material Category\"}\n ],\n \"edges\": [\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"PolyJet\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"FDM\"},\n {\"subject\": \"PolyJet\", \"predicate\": \"is a\", \"object\": \"3D Printing Technology\"},\n {\"subject\": \"FDM\", \"predicate\": \"is a\", \"object\": \"3D Printing Technology\"},\n {\"subject\": \"PLA\", \"predicate\": \"is a\", \"object\": \"Standard FDM Materials\"},\n {\"subject\": \"ABS-PLUS\", \"predicate\": \"is a\", \"object\": \"Standard FDM Materials\"},\n {\"subject\": \"ABSM30\", \"predicate\": \"is a\", \"object\": \"Stand",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"skipped": "local_model_failed",
|
||||
"local_error": "HTTPConnectionPool(host='localhost', port=11434): Read timed out. (read timeout=120)",
|
||||
"local_latency_s": 120.1
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "05_Entering the third dimension.docx",
|
||||
"size_bucket": "medium",
|
||||
"doc_chars_original": 2175,
|
||||
"doc_chars_sent": 2175,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 672,
|
||||
"output_tokens": 998,
|
||||
"latency_s": 5.16,
|
||||
"entity_count": 22,
|
||||
"edge_count": 22,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Rhino\", \"type\": \"Software\"},\n {\"name\": \"Surfaces\", \"type\": \"Geometric Component\"},\n {\"name\": \"NURBS objects\", \"type\": \"Geometric Object\"},\n {\"name\": \"Fabric\", \"type\": \"Analogy\"},\n {\"name\": \"Edge curves\", \"type\": \"Geometric Feature\"},\n {\"name\": \"Curves\", \"type\": \"Geometric Element\"},\n {\"name\": \"Seams\", \"type\": \"Geometric Feature\"},\n {\"name\": \"Border\", \"type\": \"Geometric Feature\"},\n {\"name\": \"Open cylinder\", \"type\": \"Surface Type\"},\n {\"name\": \"Torus\", \"type\": \"Surface Type\"},\n {\"name\": \"Isoparametric curves\", \"type\": \"Geometric Feature\"},\n {\"name\": \"Isocurves\", \"type\": \"Geometric Feature\"},\n {\"name\": \"U curves\", \"type\": \"Curve Direction\"},\n {\"name\": \"V curves\", \"type\": \"Curve Direction\"},\n {\"name\": \"Normal direction\", \"type\": \"Surface Property\"},\n {\"name\": \"Construction plane\", \"type\": \"Reference System\"},\n {\"name\": \"X and Y\", \"type\": \"Coordinate System\"},\n {\"name\": \"Rhino Options\", \"type\": \"Software Menu\"},\n {\"name\": \"Display Modes\", \"type\": \"Software Setting\"},\n {\"name\": \"Shaded\", \"type\": \"Display Mode\"},\n {\"name\": \"Back face settings\", \"type\": \"Software Setting\"},\n {\"name\": \"Lightweight extrusions\", \"type\": \"Surface Type\"}\n ],\n \"edges\": [\n {\"subject\": \"Surfaces\", \"predicate\": \"are the building blocks for\", \"object\": \"NURBS objects\"},\n {\"subject\": \"Surfaces\", \"predicate\": \"have\", \"object\": \"rectangular organization\"},\n {\"subject\": \"Surfaces\", \"predicate\": \"are compared to\", \"object\": \"Fabric\"},\n {\"subject\": \"Edge curves\", \"predicate\": \"describe the extent of\", \"object\": \"Surfaces\"},\n {\"subject\": \"Edge curves\", \"predicate\": \"are visually represented by\", \"object\": \"dark/thick lines\"},\n {\"subject\": \"Surfaces\", \"predicate\": \"can only be joined at\", \"object\": \"Edge curves\"},\n {\"subject\": \"Curves\", \"predicate\": \"can only be joined at\", \"object\": \"end points\"},\n {\"subject\": \"Seams\", \"predicate\": \"are formed when\", \"object\": \"Edge curves are joined\"},\n {\"subject\": \"Border\", \"predicate\": \"consists of\", \"object\": \"un-joined edges\"},\n {\"subject\": \"Surfaces\", \"predicate\": \"can be\", \"object\": \"open or closed\"},\n {\"subject\": \"Open cylinder\", \"predicate\": \"is closed in\", \"object\": \"one direction\"},\n {\"subject\": \"Torus\", \"predicate\": \"is closed in\", \"object\": \"both directions\"},\n {\"subject\": \"Isocurves\", \"predicate\": \"are visual aids for\", \"object\": \"Surfaces\"},\n {\"subject\": \"Isocurves\", \"predicate\": \"are lighter than\", \"object\": \"Edge curves\"},\n {\"subject\": \"Surfaces\", \"predicate\": \"have at least\", \"object\": \"two isocurves\"},\n {\"subject\": \"U curves\", \"predicate\": \"run in one direction\", \"object\": \"Surfaces\"},\n {\"subject\": \"V curves\", \"predicate\": \"run perpendicular to\", \"object\": \"U curves\"},\n {\"subject\": \"U and V\", \"predicate\": \"are used instead of\", \"object\": \"X and Y\"},\n {\"subject\": \"Surfaces\", \"predicate\": \"have\", \"object\": \"Normal direction\"},\n {\"subject\": \"Normal direction\", \"predicate\": \"indicates\", \"object\": \"inside or outside\"},\n {\"subject\": \"Display Modes\", \"predicate\": \"can be accessed through\", \"object\": \"Rhino Options\"},\n {\"subject\": \"Back face settings\", \"predicate\": \"is a feature of\", \"object\": \"Display Modes\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 17.22,
|
||||
"local_candidates": [
|
||||
"Rhino",
|
||||
"NURBS objects",
|
||||
"surfaces",
|
||||
"Edge curves",
|
||||
"Seams",
|
||||
"Border",
|
||||
"cylinder",
|
||||
"torus",
|
||||
"Isoparametric curves",
|
||||
"Isocurves",
|
||||
"U curves",
|
||||
"V curves"
|
||||
],
|
||||
"local_raw": "{\n \"candidates\": [\"Rhino\", \"NURBS objects\", \"surfaces\", \"Edge curves\", \"Seams\", \"Border\", \"cylinder\", \"torus\", \"Isoparametric curves\", \"Isocurves\", \"U curves\", \"V curves\"]\n}",
|
||||
"api_input_tokens": 767,
|
||||
"api_output_tokens": 792,
|
||||
"api_latency_s": 3.85,
|
||||
"entity_count": 14,
|
||||
"edge_count": 20,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Rhino\", \"type\": \"Software\"},\n {\"name\": \"NURBS objects\", \"type\": \"Geometric Object\"},\n {\"name\": \"surfaces\", \"type\": \"Geometric Object\"},\n {\"name\": \"Edge curves\", \"type\": \"Geometric Feature\"},\n {\"name\": \"Seams\", \"type\": \"Geometric Feature\"},\n {\"name\": \"Border\", \"type\": \"Geometric Feature\"},\n {\"name\": \"cylinder\", \"type\": \"Geometric Shape\"},\n {\"name\": \"torus\", \"type\": \"Geometric Shape\"},\n {\"name\": \"Isoparametric curves\", \"type\": \"Geometric Feature\"},\n {\"name\": \"Isocurves\", \"type\": \"Geometric Feature\"},\n {\"name\": \"U curves\", \"type\": \"Geometric Feature\"},\n {\"name\": \"V curves\", \"type\": \"Geometric Feature\"},\n {\"name\": \"Normal direction\", \"type\": \"Geometric Property\"},\n {\"name\": \"rectangular organization\", \"type\": \"Geometric Property\"}\n ],\n \"edges\": [\n {\"subject\": \"Rhino\", \"predicate\": \"uses\", \"object\": \"NURBS objects\"},\n {\"subject\": \"NURBS objects\", \"predicate\": \"are composed of\", \"object\": \"surfaces\"},\n {\"subject\": \"surfaces\", \"predicate\": \"have\", \"object\": \"rectangular organization\"},\n {\"subject\": \"surfaces\", \"predicate\": \"are bounded by\", \"object\": \"Edge curves\"},\n {\"subject\": \"Edge curves\", \"predicate\": \"are visually represented as\", \"object\": \"dark/thick lines\"},\n {\"subject\": \"surfaces\", \"predicate\": \"can be joined at\", \"object\": \"Edge curves\"},\n {\"subject\": \"Edge curves\", \"predicate\": \"become\", \"object\": \"Seams\"},\n {\"subject\": \"Seams\", \"predicate\": \"are\", \"object\": \"joined edges\"},\n {\"subject\": \"Border\", \"predicate\": \"consists of\", \"object\": \"un-joined edges\"},\n {\"subject\": \"cylinder\", \"predicate\": \"is\", \"object\": \"open or closed\"},\n {\"subject\": \"torus\", \"predicate\": \"is closed in\", \"object\": \"both directions\"},\n {\"subject\": \"surfaces\", \"predicate\": \"contain\", \"object\": \"Isoparametric curves\"},\n {\"subject\": \"Isoparametric curves\", \"predicate\": \"are also called\", \"object\": \"Isocurves\"},\n {\"subject\": \"Isocurves\", \"predicate\": \"are perpendicular to\", \"object\": \"one another\"},\n {\"subject\": \"Isocurves\", \"predicate\": \"are lighter than\", \"object\": \"Edge curves\"},\n {\"subject\": \"Isocurves\", \"predicate\": \"consist of\", \"object\": \"U curves\"},\n {\"subject\": \"Isocurves\", \"predicate\": \"consist of\", \"object\": \"V curves\"},\n {\"subject\": \"U curves\", \"predicate\": \"run perpendicular to\", \"object\": \"V curves\"},\n {\"subject\": \"surfaces\", \"predicate\": \"have\", \"object\": \"Normal direction\"},\n {\"subject\": \"Normal direction\", \"predicate\": \"indicates\", \"object\": \"inside or outside\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "05_PPT_ALL_Machine Technology and Specifications_v3.pptx",
|
||||
"size_bucket": "large",
|
||||
"doc_chars_original": 29095,
|
||||
"doc_chars_sent": 8000,
|
||||
"truncated": true,
|
||||
"condition_a": {
|
||||
"input_tokens": 2207,
|
||||
"output_tokens": 1840,
|
||||
"latency_s": 9.13,
|
||||
"entity_count": 30,
|
||||
"edge_count": 46,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Stratasys\", \"type\": \"Company\"},\n {\"name\": \"FDM\", \"type\": \"Technology\"},\n {\"name\": \"PolyJet\", \"type\": \"Technology\"},\n {\"name\": \"F123 series\", \"type\": \"Printer\"},\n {\"name\": \"F170\", \"type\": \"Printer\"},\n {\"name\": \"F270\", \"type\": \"Printer\"},\n {\"name\": \"F370\", \"type\": \"Printer\"},\n {\"name\": \"Fortus 450MC\", \"type\": \"Printer\"},\n {\"name\": \"Fortus 380CF\", \"type\": \"Printer\"},\n {\"name\": \"J750\", \"type\": \"Printer\"},\n {\"name\": \"ABS\", \"type\": \"Material\"},\n {\"name\": \"PLA\", \"type\": \"Material\"},\n {\"name\": \"ASA\", \"type\": \"Material\"},\n {\"name\": \"PC-ABS\", \"type\": \"Material\"},\n {\"name\": \"QSR Support\", \"type\": \"Material\"},\n {\"name\": \"GrabCAD Print\", \"type\": \"Software\"},\n {\"name\": \"Insight\", \"type\": \"Software\"},\n {\"name\": \"SCA-3600 Cleaning Station\", \"type\": \"Equipment\"},\n {\"name\": \"SCA 1200 Cleaning Station\", \"type\": \"Equipment\"},\n {\"name\": \"Infinite-Build\", \"type\": \"Technology Feature\"},\n {\"name\": \"Continuous Build\", \"type\": \"Technology Feature\"},\n {\"name\": \"Concept Verification\", \"type\": \"Use Case\"},\n {\"name\": \"Design Validation\", \"type\": \"Use Case\"},\n {\"name\": \"Functional Performance\", \"type\": \"Use Case\"},\n {\"name\": \"Jigs & Fixtures\", \"type\": \"Use Case\"},\n {\"name\": \"Rapid Prototyping\", \"type\": \"Use Case\"},\n {\"name\": \"Motocross helmet air vent\", \"type\": \"Application Example\"},\n {\"name\": \"Smart home switch housing\", \"type\": \"Application Example\"},\n {\"name\": \"Mechanical iris\", \"type\": \"Application Example\"},\n {\"name\": \"1988\", \"type\": \"Year\"}\n ],\n \"edges\": [\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"FDM\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"PolyJet\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"F123 series\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"F170\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"F270\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"F370\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"Fortus 450MC\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"Fortus 380CF\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"J750\"},\n {\"subject\": \"F123 series\", \"predicate\": \"includes\", \"object\": \"F170\"},\n {\"subject\": \"F123 series\", \"predicate\": \"includes\", \"object\": \"F270\"},\n {\"subject\": \"F123 series\", \"predicate\": \"includes\", \"object\": \"F370\"},\n {\"subject\": \"F123 series\", \"predicate\": \"uses technology\", \"object\": \"FDM\"},\n {\"subject\": \"F170\", \"predicate\": \"has build size\", \"object\": \"10x10x10 inches\"},\n {\"subject\": \"F270\", \"predicate\": \"has build size\", \"object\": \"12x10x12 inches\"},\n {\"subject\": \"F370\", \"predicate\": \"has build size\", \"object\": \"14x10x14 inches\"},\n {\"subject\": \"F123 series\", \"predicate\": \"supports materials\", \"object\": \"ABS\"},\n {\"subject\": \"F123 series\", \"predicate\": \"supports materials\", \"object\": \"PLA\"},\n {\"subject\": \"F123 series\", \"predicate\": \"supports materials\", \"object\": \"ASA\"},\n {\"subject\": \"F370\", \"predicate\": \"supports materials\", \"object\": \"PC-ABS\"},\n {\"subject\": \"F123 series\", \"predicate\": \"uses software\", \"object\": \"GrabCAD Print\"},\n {\"subject\": \"J750\", \"predicate\": \"compatible with\", \"object\": \"GrabCAD Print\"},\n {\"subject\": \"F123 series\", \"predicate\": \"uses software\", \"object\": \"Insight\"},\n {\"subject\": \"F123 series\", \"predicate\": \"compatible with\", \"object\": \"SCA-3600 Cleaning Station\"},\n {\"subject\": \"F123 series\", \"predicate\": \"compatible with\", \"object\": \"SCA 1200 Cleaning Station\"},\n {\"subject\": \"FDM\", \"predicate\": \"founded by\", \"object\": \"Stratasys\"},\n {\"subject\": \"FDM\", \"predicate\": \"founded in\", \"object\": \"1988\"},\n {\"subject\": \"PolyJet\", \"predicate\": \"evolved from\", \"object\": \"1 material to 6 materials\"},\n {\"subject\": \"F170\", \"predicate\": \"designed for\", \"object\": \"Schools and professionals\"},\n {\"sub",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 84.99,
|
||||
"local_candidates": [
|
||||
"Stratasys",
|
||||
"FDM",
|
||||
"PolyJet",
|
||||
"YouTube",
|
||||
"Industrial Manufacturing",
|
||||
"Strategy",
|
||||
"F123 series",
|
||||
"Fortus 450MC",
|
||||
"J750",
|
||||
"GrabCAD Print",
|
||||
"ABS",
|
||||
"PLA",
|
||||
"ASA",
|
||||
"PC-ABS",
|
||||
"QSR Support",
|
||||
"SCA-3600 Cleaning Station",
|
||||
"SCA 1200 Cleaning Station",
|
||||
"STREAMLINED DESIGN-TO-3D PRINT"
|
||||
],
|
||||
"local_raw": "{\n \"candidates\": [\n \"Stratasys\",\n \"FDM\",\n \"PolyJet\",\n \"YouTube\",\n \"Industrial Manufacturing\",\n \"Strategy\",\n \"F123 series\",\n \"Fortus 450MC\",\n \"J750\",\n \"GrabCAD Print\",\n \"ABS\",\n \"PLA\",\n \"ASA\",\n \"PC-ABS\",\n \"QSR Support\",\n \"SCA-3600 Cleaning Station\",\n \"SCA 1200 Cleaning Station\",\n \"STREAMLINED DESIGN-TO-3D PRINT\"\n ]\n}",
|
||||
"api_input_tokens": 2358,
|
||||
"api_output_tokens": 1391,
|
||||
"api_latency_s": 7.11,
|
||||
"entity_count": 24,
|
||||
"edge_count": 35,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Stratasys\", \"type\": \"Company\"},\n {\"name\": \"FDM\", \"type\": \"Technology\"},\n {\"name\": \"PolyJet\", \"type\": \"Technology\"},\n {\"name\": \"YouTube\", \"type\": \"Platform\"},\n {\"name\": \"Industrial Manufacturing\", \"type\": \"Industry\"},\n {\"name\": \"Strategy\", \"type\": \"Concept\"},\n {\"name\": \"F123 series\", \"type\": \"Printer Model\"},\n {\"name\": \"F170\", \"type\": \"Printer Model\"},\n {\"name\": \"F270\", \"type\": \"Printer Model\"},\n {\"name\": \"F370\", \"type\": \"Printer Model\"},\n {\"name\": \"Fortus 450MC\", \"type\": \"Printer Model\"},\n {\"name\": \"Fortus 380CF\", \"type\": \"Printer Model\"},\n {\"name\": \"J750\", \"type\": \"Printer Model\"},\n {\"name\": \"GrabCAD Print\", \"type\": \"Software\"},\n {\"name\": \"ABS\", \"type\": \"Material\"},\n {\"name\": \"PLA\", \"type\": \"Material\"},\n {\"name\": \"ASA\", \"type\": \"Material\"},\n {\"name\": \"PC-ABS\", \"type\": \"Material\"},\n {\"name\": \"QSR Support\", \"type\": \"Support Material\"},\n {\"name\": \"SCA-3600 Cleaning Station\", \"type\": \"Equipment\"},\n {\"name\": \"SCA 1200 Cleaning Station\", \"type\": \"Equipment\"},\n {\"name\": \"Insight\", \"type\": \"Software\"},\n {\"name\": \"Infinite-Build\", \"type\": \"Technology Feature\"},\n {\"name\": \"Continuous Build\", \"type\": \"Technology Feature\"}\n ],\n \"edges\": [\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"FDM\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"PolyJet\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"F123 series\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"Fortus 450MC\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"Fortus 380CF\"},\n {\"subject\": \"Stratasys\", \"predicate\": \"manufactures\", \"object\": \"J750\"},\n {\"subject\": \"F123 series\", \"predicate\": \"includes models\", \"object\": \"F170\"},\n {\"subject\": \"F123 series\", \"predicate\": \"includes models\", \"object\": \"F270\"},\n {\"subject\": \"F123 series\", \"predicate\": \"includes models\", \"object\": \"F370\"},\n {\"subject\": \"F123 series\", \"predicate\": \"uses technology\", \"object\": \"FDM\"},\n {\"subject\": \"F170\", \"predicate\": \"supports materials\", \"object\": \"ABS\"},\n {\"subject\": \"F170\", \"predicate\": \"supports materials\", \"object\": \"PLA\"},\n {\"subject\": \"F170\", \"predicate\": \"supports materials\", \"object\": \"ASA\"},\n {\"subject\": \"F270\", \"predicate\": \"supports materials\", \"object\": \"ABS\"},\n {\"subject\": \"F270\", \"predicate\": \"supports materials\", \"object\": \"PLA\"},\n {\"subject\": \"F270\", \"predicate\": \"supports materials\", \"object\": \"ASA\"},\n {\"subject\": \"F370\", \"predicate\": \"supports materials\", \"object\": \"ABS\"},\n {\"subject\": \"F370\", \"predicate\": \"supports materials\", \"object\": \"PLA\"},\n {\"subject\": \"F370\", \"predicate\": \"supports materials\", \"object\": \"ASA\"},\n {\"subject\": \"F370\", \"predicate\": \"supports materials\", \"object\": \"PC-ABS\"},\n {\"subject\": \"F123 series\", \"predicate\": \"uses software\", \"object\": \"GrabCAD Print\"},\n {\"subject\": \"F123 series\", \"predicate\": \"compatible with cleaning station\", \"object\": \"SCA-3600 Cleaning Station\"},\n {\"subject\": \"F123 series\", \"predicate\": \"compatible with cleaning station\", \"object\": \"SCA 1200 Cleaning Station\"},\n {\"subject\": \"F123 series\", \"predicate\": \"supports\", \"object\": \"QSR Support\"},\n {\"subject\": \"J750\", \"predicate\": \"compatible with software\", \"object\": \"GrabCAD Print\"},\n {\"subject\": \"GrabCAD Print\", \"predicate\": \"connects to\", \"object\": \"Stratasys\"},\n {\"subject\": \"Insight\", \"predicate\": \"optimizes\", \"object\": \"F123 series\"},\n {\"subject\": \"FDM\", \"predicate\": \"founded by\", \"object\": \"Stratasys\"},\n {\"subject\": \"Industrial Manufacturing\", \"predicate\": \"requires\", \"object\": \"Strategy\"},\n {\"subject\": \"F123 series\", \"predicate\": \"used for\", \"object\": \"Rapid Prototyping\"},\n {\"subject\": \"F123 series\", \"predicate\": \"used for\", \"object\": \"Concept Verification\"},\n {\"subject\": \"F123 series\", \"predicate\": \"used for\", \"object\": \"Design Validation\"},\n {\"subject\": \"F123 series\", \"predicate\": \"used for\",",
|
||||
"error": null
|
||||
}
|
||||
},
|
||||
{
|
||||
"source": "06_Gumball.docx",
|
||||
"size_bucket": "medium",
|
||||
"doc_chars_original": 1980,
|
||||
"doc_chars_sent": 1980,
|
||||
"truncated": false,
|
||||
"condition_a": {
|
||||
"input_tokens": 652,
|
||||
"output_tokens": 1437,
|
||||
"latency_s": 6.71,
|
||||
"entity_count": 34,
|
||||
"edge_count": 29,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Gumball\", \"type\": \"Tool\"},\n {\"name\": \"Widget\", \"type\": \"Component\"},\n {\"name\": \"Selected object\", \"type\": \"Entity\"},\n {\"name\": \"Move transformation\", \"type\": \"Transformation\"},\n {\"name\": \"Scale transformation\", \"type\": \"Transformation\"},\n {\"name\": \"Rotate transformation\", \"type\": \"Transformation\"},\n {\"name\": \"Gumball origin\", \"type\": \"Reference point\"},\n {\"name\": \"Gumball menu\", \"type\": \"Interface\"},\n {\"name\": \"White menu ball\", \"type\": \"UI element\"},\n {\"name\": \"Relocate Gumball\", \"type\": \"Menu option\"},\n {\"name\": \"Gumball center\", \"type\": \"Reference point\"},\n {\"name\": \"Reset Gumball\", \"type\": \"Menu option\"},\n {\"name\": \"Area centroid\", \"type\": \"Reference point\"},\n {\"name\": \"Align to CPlane\", \"type\": \"Menu option\"},\n {\"name\": \"Align to Object\", \"type\": \"Menu option\"},\n {\"name\": \"Align to World\", \"type\": \"Menu option\"},\n {\"name\": \"CPlane\", \"type\": \"Coordinate system\"},\n {\"name\": \"X axis\", \"type\": \"Axis\"},\n {\"name\": \"Y axis\", \"type\": \"Axis\"},\n {\"name\": \"Z axis\", \"type\": \"Axis\"},\n {\"name\": \"U curve\", \"type\": \"Object property\"},\n {\"name\": \"V curve\", \"type\": \"Object property\"},\n {\"name\": \"World grid system\", \"type\": \"Coordinate system\"},\n {\"name\": \"Snappy dragging\", \"type\": \"Dragging mode\"},\n {\"name\": \"Smooth dragging\", \"type\": \"Dragging mode\"},\n {\"name\": \"Osnaps\", \"type\": \"Feature\"},\n {\"name\": \"Drag strength\", \"type\": \"Parameter\"},\n {\"name\": \"Gumball elements\", \"type\": \"Component\"},\n {\"name\": \"Red color\", \"type\": \"Color\"},\n {\"name\": \"Green color\", \"type\": \"Color\"},\n {\"name\": \"Blue color\", \"type\": \"Color\"},\n {\"name\": \"Dotted lines\", \"type\": \"Visual element\"},\n {\"name\": \"Arrow\", \"type\": \"Visual element\"},\n {\"name\": \"Box\", \"type\": \"Visual element\"}\n ],\n \"edges\": [\n {\"subject\": \"Gumball\", \"predicate\": \"is a\", \"object\": \"Widget\"},\n {\"subject\": \"Gumball\", \"predicate\": \"is used to facilitate\", \"object\": \"Direct editing\"},\n {\"subject\": \"Gumball\", \"predicate\": \"provides\", \"object\": \"Move transformation\"},\n {\"subject\": \"Gumball\", \"predicate\": \"provides\", \"object\": \"Scale transformation\"},\n {\"subject\": \"Gumball\", \"predicate\": \"provides\", \"object\": \"Rotate transformation\"},\n {\"subject\": \"Gumball\", \"predicate\": \"operates around\", \"object\": \"Gumball origin\"},\n {\"subject\": \"Gumball menu\", \"predicate\": \"is accessed by\", \"object\": \"White menu ball\"},\n {\"subject\": \"Relocate Gumball\", \"predicate\": \"allows redefinition of\", \"object\": \"Gumball center\"},\n {\"subject\": \"Gumball center\", \"predicate\": \"is the origin point for\", \"object\": \"Scaling\"},\n {\"subject\": \"Gumball center\", \"predicate\": \"is the origin point for\", \"object\": \"Translating\"},\n {\"subject\": \"Gumball center\", \"predicate\": \"is the origin point for\", \"object\": \"Rotating\"},\n {\"subject\": \"Reset Gumball\", \"predicate\": \"places gumball at\", \"object\": \"Area centroid\"},\n {\"subject\": \"Align to CPlane\", \"predicate\": \"makes arrows point in direction of\", \"object\": \"X axis\"},\n {\"subject\": \"Align to CPlane\", \"predicate\": \"makes arrows point in direction of\", \"object\": \"Y axis\"},\n {\"subject\": \"Align to CPlane\", \"predicate\": \"makes arrows point in direction of\", \"object\": \"Z axis\"},\n {\"subject\": \"Align to Object\", \"predicate\": \"relates gumball to\", \"object\": \"U curve\"},\n {\"subject\": \"Align to Object\", \"predicate\": \"relates gumball to\", \"object\": \"V curve\"},\n {\"subject\": \"Align to World\", \"predicate\": \"relates gumball to\", \"object\": \"World grid system\"},\n {\"subject\": \"Smooth dragging\", \"predicate\": \"allows dragging without\", \"object\": \"Restraints\"},\n {\"subject\": \"Snappy dragging\", \"predicate\": \"is equivalent to\", \"object\": \"Osnaps\"},\n {\"subject\": \"Snappy dragging\", \"predicate\": \"causes object to snap to\", \"object\": \"End point\"},\n {\"subject\": \"Snappy dragging\", \"predicate\": \"causes object to snap to\", \"object\": \"Mid point\"},\n {\"subject\": \"Drag strength\", \"predicate\": \"dictates how\", \"object\": \"Mouse mov",
|
||||
"error": null
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": 19.62,
|
||||
"local_candidates": [
|
||||
"Gumball",
|
||||
"Menu",
|
||||
"White menu ball",
|
||||
"Relocate Gumball",
|
||||
"Origin point",
|
||||
"Scaling",
|
||||
"Translating",
|
||||
"Rotating",
|
||||
"Reset Gumball",
|
||||
"Area centroid",
|
||||
"CPlane",
|
||||
"Object",
|
||||
"World",
|
||||
"Snappy draging",
|
||||
"Smooth draging",
|
||||
"Osnaps",
|
||||
"Illustraitor",
|
||||
"Drag Strength",
|
||||
"Red",
|
||||
"Green",
|
||||
"Blue"
|
||||
],
|
||||
"local_raw": "{\n \"candidates\": [\"Gumball\", \"Menu\", \"White menu ball\", \"Relocate Gumball\", \"Origin point\", \"Scaling\", \"Translating\", \"Rotating\", \"Reset Gumball\", \"Area centroid\", \"CPlane\", \"Object\", \"World\", \"Snappy draging\", \"Smooth draging\", \"Osnaps\", \"Illustraitor\", \"Drag Strength\", \"Red\", \"Green\", \"Blue\"]\n}",
|
||||
"api_input_tokens": 792,
|
||||
"api_output_tokens": 1050,
|
||||
"api_latency_s": 5.15,
|
||||
"entity_count": 24,
|
||||
"edge_count": 23,
|
||||
"stop_reason": "end_turn",
|
||||
"response_text": "```json\n{\n \"entities\": [\n {\"name\": \"Gumball\", \"type\": \"Widget\"},\n {\"name\": \"Menu\", \"type\": \"UI Component\"},\n {\"name\": \"White menu ball\", \"type\": \"UI Element\"},\n {\"name\": \"Relocate Gumball\", \"type\": \"Function\"},\n {\"name\": \"Origin point\", \"type\": \"Concept\"},\n {\"name\": \"Scaling\", \"type\": \"Transformation\"},\n {\"name\": \"Translating\", \"type\": \"Transformation\"},\n {\"name\": \"Rotating\", \"type\": \"Transformation\"},\n {\"name\": \"Reset Gumball\", \"type\": \"Function\"},\n {\"name\": \"Area centroid\", \"type\": \"Concept\"},\n {\"name\": \"CPlane\", \"type\": \"Alignment Option\"},\n {\"name\": \"Object\", \"type\": \"Alignment Option\"},\n {\"name\": \"World\", \"type\": \"Alignment Option\"},\n {\"name\": \"Snappy dragging\", \"type\": \"Dragging Mode\"},\n {\"name\": \"Smooth dragging\", \"type\": \"Dragging Mode\"},\n {\"name\": \"Osnaps\", \"type\": \"Feature\"},\n {\"name\": \"Drag Strength\", \"type\": \"Parameter\"},\n {\"name\": \"Red\", \"type\": \"Color\"},\n {\"name\": \"Green\", \"type\": \"Color\"},\n {\"name\": \"Blue\", \"type\": \"Color\"},\n {\"name\": \"X axis\", \"type\": \"Axis\"},\n {\"name\": \"Y axis\", \"type\": \"Axis\"},\n {\"name\": \"Z axis\", \"type\": \"Axis\"},\n {\"name\": \"Illustrator\", \"type\": \"Software\"}\n ],\n \"edges\": [\n {\"subject\": \"Gumball\", \"predicate\": \"is a widget used to facilitate\", \"object\": \"direct editing\"},\n {\"subject\": \"Gumball\", \"predicate\": \"provides\", \"object\": \"Scaling\"},\n {\"subject\": \"Gumball\", \"predicate\": \"provides\", \"object\": \"Translating\"},\n {\"subject\": \"Gumball\", \"predicate\": \"provides\", \"object\": \"Rotating\"},\n {\"subject\": \"White menu ball\", \"predicate\": \"opens\", \"object\": \"Menu\"},\n {\"subject\": \"Relocate Gumball\", \"predicate\": \"redefines\", \"object\": \"Origin point\"},\n {\"subject\": \"Origin point\", \"predicate\": \"is the center for\", \"object\": \"Scaling\"},\n {\"subject\": \"Origin point\", \"predicate\": \"is the center for\", \"object\": \"Translating\"},\n {\"subject\": \"Origin point\", \"predicate\": \"is the center for\", \"object\": \"Rotating\"},\n {\"subject\": \"Reset Gumball\", \"predicate\": \"places gumball at\", \"object\": \"Area centroid\"},\n {\"subject\": \"Gumball\", \"predicate\": \"can align to\", \"object\": \"CPlane\"},\n {\"subject\": \"Gumball\", \"predicate\": \"can align to\", \"object\": \"Object\"},\n {\"subject\": \"Gumball\", \"predicate\": \"can align to\", \"object\": \"World\"},\n {\"subject\": \"CPlane alignment\", \"predicate\": \"points arrows in direction of\", \"object\": \"X axis\"},\n {\"subject\": \"CPlane alignment\", \"predicate\": \"points arrows in direction of\", \"object\": \"Y axis\"},\n {\"subject\": \"CPlane alignment\", \"predicate\": \"points arrows in direction of\", \"object\": \"Z axis\"},\n {\"subject\": \"Smooth dragging\", \"predicate\": \"allows dragging without\", \"object\": \"restraints\"},\n {\"subject\": \"Snappy dragging\", \"predicate\": \"is equivalent to turning on\", \"object\": \"Osnaps\"},\n {\"subject\": \"Drag Strength\", \"predicate\": \"dictates how mouse movement affects\", \"object\": \"object\"},\n {\"subject\": \"Red\", \"predicate\": \"is assigned to\", \"object\": \"X axis\"},\n {\"subject\": \"Green\", \"predicate\": \"is assigned to\", \"object\": \"Y axis\"},\n {\"subject\": \"Blue\", \"predicate\": \"is assigned to\", \"object\": \"Z axis\"},\n {\"subject\": \"Blue element\", \"predicate\": \"alters object in\", \"object\": \"Z axis\"}\n ]\n}\n```",
|
||||
"error": null
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,332 @@
|
||||
[
|
||||
{
|
||||
"name": "ChatGPT: Provide DXF file",
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"bucket": "low",
|
||||
"delta_preds": 4,
|
||||
"delta_edges": 11,
|
||||
"prod_preds": 5,
|
||||
"cascade_preds": 9
|
||||
},
|
||||
{
|
||||
"name": "Claude: Internship agreement writing help",
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"bucket": "mid",
|
||||
"delta_preds": 0,
|
||||
"delta_edges": 1,
|
||||
"prod_preds": 6,
|
||||
"cascade_preds": 6
|
||||
},
|
||||
{
|
||||
"name": "Claude: Interview presentation research and preparation",
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "i would say that this is edu and prof",
|
||||
"bucket": "mid",
|
||||
"delta_preds": -4,
|
||||
"delta_edges": 0,
|
||||
"prod_preds": 9,
|
||||
"cascade_preds": 5
|
||||
},
|
||||
{
|
||||
"name": "Circuits II.pptx",
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"bucket": "document",
|
||||
"delta_preds": 2,
|
||||
"delta_edges": 5,
|
||||
"prod_preds": 5,
|
||||
"cascade_preds": 7
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Scholarship Recommendation Letter Tips",
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"bucket": "low",
|
||||
"delta_preds": -1,
|
||||
"delta_edges": -1,
|
||||
"prod_preds": 2,
|
||||
"cascade_preds": 1
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: SEC coaches with OSU ties",
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"bucket": "high",
|
||||
"delta_preds": -1,
|
||||
"delta_edges": -1,
|
||||
"prod_preds": 5,
|
||||
"cascade_preds": 4
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Tulsa Concept Album Guide",
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"bucket": "high",
|
||||
"delta_preds": -6,
|
||||
"delta_edges": -3,
|
||||
"prod_preds": 10,
|
||||
"cascade_preds": 4
|
||||
},
|
||||
{
|
||||
"name": "Claude: I filling out my annual report...",
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "edu and prof",
|
||||
"bucket": "mid",
|
||||
"delta_preds": 1,
|
||||
"delta_edges": 3,
|
||||
"prod_preds": 6,
|
||||
"cascade_preds": 7
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Remington 700 5R Gen 1",
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "edu and personal",
|
||||
"bucket": "mid",
|
||||
"delta_preds": 1,
|
||||
"delta_edges": 0,
|
||||
"prod_preds": 7,
|
||||
"cascade_preds": 8
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Research Statement Restructure",
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "edu and prof and personal maybe depending on how you see a job search",
|
||||
"bucket": "mid",
|
||||
"delta_preds": -2,
|
||||
"delta_edges": 3,
|
||||
"prod_preds": 4,
|
||||
"cascade_preds": 2
|
||||
},
|
||||
{
|
||||
"name": "Claude: SUNY faculty conflict of interest policies",
|
||||
"binary": "multi",
|
||||
"score": 2,
|
||||
"note": "this is edu also prof but might also be personal becuase its work outside of my normal job",
|
||||
"bucket": "mid",
|
||||
"delta_preds": -4,
|
||||
"delta_edges": -7,
|
||||
"prod_preds": 7,
|
||||
"cascade_preds": 3
|
||||
},
|
||||
{
|
||||
"name": "Nic Oconnor Field Work F2023 Syllabus.pdf",
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"bucket": "document",
|
||||
"delta_preds": 9,
|
||||
"delta_edges": 7,
|
||||
"prod_preds": 5,
|
||||
"cascade_preds": 14
|
||||
},
|
||||
{
|
||||
"name": "New Mexico Cover Letter.docx",
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "and application letter to an EDU job fits in two domains",
|
||||
"bucket": "document",
|
||||
"delta_preds": 3,
|
||||
"delta_edges": 8,
|
||||
"prod_preds": 7,
|
||||
"cascade_preds": 10
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Testing Vector Alignment",
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"bucket": "low",
|
||||
"delta_preds": 0,
|
||||
"delta_edges": 0,
|
||||
"prod_preds": 3,
|
||||
"cascade_preds": 3
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Sink Sprayer Fitting Name",
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"bucket": "low",
|
||||
"delta_preds": 0,
|
||||
"delta_edges": -1,
|
||||
"prod_preds": 3,
|
||||
"cascade_preds": 3
|
||||
},
|
||||
{
|
||||
"name": "Polar Coordinates.pptx",
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "this could be both edu and professional",
|
||||
"bucket": "document",
|
||||
"delta_preds": 0,
|
||||
"delta_edges": 0,
|
||||
"prod_preds": 8,
|
||||
"cascade_preds": 8
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Rhino 3D object flow",
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "this is the same its edu but part of my profession",
|
||||
"bucket": "high",
|
||||
"delta_preds": -5,
|
||||
"delta_edges": -3,
|
||||
"prod_preds": 16,
|
||||
"cascade_preds": 11
|
||||
},
|
||||
{
|
||||
"name": "Aaron AI: So, I've been working on the RNAI project, and the way I've ...",
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "technical but apersonal project",
|
||||
"bucket": "mid",
|
||||
"delta_preds": 1,
|
||||
"delta_edges": 2,
|
||||
"prod_preds": 5,
|
||||
"cascade_preds": 6
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Regex for inserting letters",
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": "this is a technical question",
|
||||
"bucket": "low",
|
||||
"delta_preds": 0,
|
||||
"delta_edges": 0,
|
||||
"prod_preds": 4,
|
||||
"cascade_preds": 4
|
||||
},
|
||||
{
|
||||
"name": "Claude: Evaluating tenure prospects at R1 universities",
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "this to me is EDU and professional",
|
||||
"bucket": "high",
|
||||
"delta_preds": 12,
|
||||
"delta_edges": 31,
|
||||
"prod_preds": 3,
|
||||
"cascade_preds": 15
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Title: User request summary.",
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "this is agina is edu and prfoessional",
|
||||
"bucket": "low",
|
||||
"delta_preds": -1,
|
||||
"delta_edges": -1,
|
||||
"prod_preds": 3,
|
||||
"cascade_preds": 2
|
||||
},
|
||||
{
|
||||
"name": "University of North Texas Cover letter.pdf",
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "edu and professional, im applying for a job i forgot i applied for",
|
||||
"bucket": "document",
|
||||
"delta_preds": -2,
|
||||
"delta_edges": -1,
|
||||
"prod_preds": 10,
|
||||
"cascade_preds": 8
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Resume formatting and review",
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "personal and professional, im trying to get anotehr job",
|
||||
"bucket": "high",
|
||||
"delta_preds": 1,
|
||||
"delta_edges": 3,
|
||||
"prod_preds": 8,
|
||||
"cascade_preds": 9
|
||||
},
|
||||
{
|
||||
"name": "Nic Oconnor Ind Study S2024 Syllabus.docx",
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"bucket": "document",
|
||||
"delta_preds": 2,
|
||||
"delta_edges": 1,
|
||||
"prod_preds": 6,
|
||||
"cascade_preds": 8
|
||||
},
|
||||
{
|
||||
"name": "Claude: Lubbock on everything album lyrics",
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": "personal too tho",
|
||||
"bucket": "high",
|
||||
"delta_preds": -7,
|
||||
"delta_edges": -12,
|
||||
"prod_preds": 11,
|
||||
"cascade_preds": 4
|
||||
},
|
||||
{
|
||||
"name": "Claude: Bonding ASA 3D printed parts",
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "edu and professional",
|
||||
"bucket": "mid",
|
||||
"delta_preds": -1,
|
||||
"delta_edges": 1,
|
||||
"prod_preds": 5,
|
||||
"cascade_preds": 4
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Respect Individual Interests for Christmas",
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": "personal to me not edu",
|
||||
"bucket": "low",
|
||||
"delta_preds": -2,
|
||||
"delta_edges": -2,
|
||||
"prod_preds": 3,
|
||||
"cascade_preds": 1
|
||||
},
|
||||
{
|
||||
"name": "Claude: Finding ideal rural housing near University of Utah",
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "personal too where do i live?",
|
||||
"bucket": "high",
|
||||
"delta_preds": -1,
|
||||
"delta_edges": 2,
|
||||
"prod_preds": 14,
|
||||
"cascade_preds": 13
|
||||
},
|
||||
{
|
||||
"name": "Claude: Law enforcement career options",
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "edu and perfessional, where do i work?",
|
||||
"bucket": "high",
|
||||
"delta_preds": 1,
|
||||
"delta_edges": -2,
|
||||
"prod_preds": 18,
|
||||
"cascade_preds": 19
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Rectangle Edge Offset Algorithm",
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": "this is me asking for technical help",
|
||||
"bucket": "low",
|
||||
"delta_preds": 0,
|
||||
"delta_edges": 3,
|
||||
"prod_preds": 2,
|
||||
"cascade_preds": 2
|
||||
}
|
||||
]
|
||||
@@ -0,0 +1,216 @@
|
||||
{
|
||||
"ratings": [
|
||||
{
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"name": "ChatGPT: Provide DXF file"
|
||||
},
|
||||
{
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"name": "Claude: Internship agreement writing help"
|
||||
},
|
||||
{
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "i would say that this is edu and prof",
|
||||
"name": "Claude: Interview presentation research and preparation"
|
||||
},
|
||||
{
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"name": "Circuits II.pptx"
|
||||
},
|
||||
{
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"name": "ChatGPT: Scholarship Recommendation Letter Tips"
|
||||
},
|
||||
{
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"name": "ChatGPT: SEC coaches with OSU ties"
|
||||
},
|
||||
{
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"name": "ChatGPT: Tulsa Concept Album Guide"
|
||||
},
|
||||
{
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "edu and prof",
|
||||
"name": "Claude: I filling out my annual report..."
|
||||
},
|
||||
{
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "edu and personal",
|
||||
"name": "ChatGPT: Remington 700 5R Gen 1"
|
||||
},
|
||||
{
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "edu and prof and personal maybe depending on how you see a job search",
|
||||
"name": "ChatGPT: Research Statement Restructure"
|
||||
},
|
||||
{
|
||||
"binary": "multi",
|
||||
"score": 2,
|
||||
"note": "this is edu also prof but might also be personal becuase its work outside of my normal job",
|
||||
"name": "Claude: SUNY faculty conflict of interest policies"
|
||||
},
|
||||
{
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"name": "Nic Oconnor Field Work F2023 Syllabus.pdf"
|
||||
},
|
||||
{
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "and application letter to an EDU job fits in two domains",
|
||||
"name": "New Mexico Cover Letter.docx"
|
||||
},
|
||||
{
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"name": "ChatGPT: Testing Vector Alignment"
|
||||
},
|
||||
{
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"name": "ChatGPT: Sink Sprayer Fitting Name"
|
||||
},
|
||||
{
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "this could be both edu and professional",
|
||||
"name": "Polar Coordinates.pptx"
|
||||
},
|
||||
{
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "this is the same its edu but part of my profession",
|
||||
"name": "ChatGPT: Rhino 3D object flow"
|
||||
},
|
||||
{
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "technical but apersonal project",
|
||||
"name": "Aaron AI: So, I've been working on the RNAI project, and the way I've ..."
|
||||
},
|
||||
{
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": "this is a technical question",
|
||||
"name": "ChatGPT: Regex for inserting letters"
|
||||
},
|
||||
{
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "this to me is EDU and professional",
|
||||
"name": "Claude: Evaluating tenure prospects at R1 universities"
|
||||
},
|
||||
{
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "this is agina is edu and prfoessional",
|
||||
"name": "ChatGPT: Title: User request summary."
|
||||
},
|
||||
{
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "edu and professional, im applying for a job i forgot i applied for",
|
||||
"name": "University of North Texas Cover letter.pdf"
|
||||
},
|
||||
{
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "personal and professional, im trying to get anotehr job",
|
||||
"name": "ChatGPT: Resume formatting and review"
|
||||
},
|
||||
{
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": null,
|
||||
"name": "Nic Oconnor Ind Study S2024 Syllabus.docx"
|
||||
},
|
||||
{
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": "personal too tho",
|
||||
"name": "Claude: Lubbock on everything album lyrics"
|
||||
},
|
||||
{
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "edu and professional",
|
||||
"name": "Claude: Bonding ASA 3D printed parts"
|
||||
},
|
||||
{
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": "personal to me not edu",
|
||||
"name": "ChatGPT: Respect Individual Interests for Christmas"
|
||||
},
|
||||
{
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "personal too where do i live?",
|
||||
"name": "Claude: Finding ideal rural housing near University of Utah"
|
||||
},
|
||||
{
|
||||
"binary": "multi",
|
||||
"score": 3,
|
||||
"note": "edu and perfessional, where do i work?",
|
||||
"name": "Claude: Law enforcement career options"
|
||||
},
|
||||
{
|
||||
"binary": "single",
|
||||
"score": 5,
|
||||
"note": "this is me asking for technical help",
|
||||
"name": "ChatGPT: Rectangle Edge Offset Algorithm"
|
||||
}
|
||||
],
|
||||
"completed_names": [
|
||||
"ChatGPT: Provide DXF file",
|
||||
"Claude: Internship agreement writing help",
|
||||
"Claude: Interview presentation research and preparation",
|
||||
"Circuits II.pptx",
|
||||
"ChatGPT: Scholarship Recommendation Letter Tips",
|
||||
"ChatGPT: SEC coaches with OSU ties",
|
||||
"ChatGPT: Tulsa Concept Album Guide",
|
||||
"Claude: I filling out my annual report...",
|
||||
"ChatGPT: Remington 700 5R Gen 1",
|
||||
"ChatGPT: Research Statement Restructure",
|
||||
"Claude: SUNY faculty conflict of interest policies",
|
||||
"Nic Oconnor Field Work F2023 Syllabus.pdf",
|
||||
"New Mexico Cover Letter.docx",
|
||||
"ChatGPT: Testing Vector Alignment",
|
||||
"ChatGPT: Sink Sprayer Fitting Name",
|
||||
"Polar Coordinates.pptx",
|
||||
"ChatGPT: Rhino 3D object flow",
|
||||
"Aaron AI: So, I've been working on the RNAI project, and the way I've ...",
|
||||
"ChatGPT: Regex for inserting letters",
|
||||
"Claude: Evaluating tenure prospects at R1 universities",
|
||||
"ChatGPT: Title: User request summary.",
|
||||
"University of North Texas Cover letter.pdf",
|
||||
"ChatGPT: Resume formatting and review",
|
||||
"Nic Oconnor Ind Study S2024 Syllabus.docx",
|
||||
"Claude: Lubbock on everything album lyrics",
|
||||
"Claude: Bonding ASA 3D printed parts",
|
||||
"ChatGPT: Respect Individual Interests for Christmas",
|
||||
"Claude: Finding ideal rural housing near University of Utah",
|
||||
"Claude: Law enforcement career options",
|
||||
"ChatGPT: Rectangle Edge Offset Algorithm"
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,281 @@
|
||||
{
|
||||
"validation_rate": 48.148148148148145,
|
||||
"correct": 13,
|
||||
"total": 27,
|
||||
"thresholds": {
|
||||
"high": 0.7,
|
||||
"low": 0.4
|
||||
},
|
||||
"results": [
|
||||
{
|
||||
"name": "Claude: Lubbock on everything album lyrics",
|
||||
"subsample": "a",
|
||||
"bucket": "high",
|
||||
"frame_relationships": "The Songwriting frame is the primary focus of this content, with the Storytelling and Literary Analysis frames providing context and analysis for understanding Terry Allen's writing style.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "baseline",
|
||||
"match": true
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Tulsa Concept Album Guide",
|
||||
"subsample": "a",
|
||||
"bucket": "high",
|
||||
"frame_relationships": "The Personal_Narrative and Place_Exploration frames are intertwined, with the album being a personal account of the writer's experiences in Tulsa. The Autobiography frame is used to disguise the concept album as a traditional autobiography.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "taxfree",
|
||||
"match": false
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Rhino 3D object flow",
|
||||
"subsample": "a",
|
||||
"bucket": "high",
|
||||
"frame_relationships": "The philosophical concepts provide a framework for understanding and describing digital fabrication processes.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "baseline",
|
||||
"match": true
|
||||
},
|
||||
{
|
||||
"name": "Claude: SUNY faculty conflict of interest policies",
|
||||
"subsample": "a",
|
||||
"bucket": "mid",
|
||||
"frame_relationships": "The document primarily discusses the SUNY conflict of interest policy for faculty members, focusing on outside work and consulting. The system-level policy and specific campus handbooks are provided as resources.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "baseline",
|
||||
"match": true
|
||||
},
|
||||
{
|
||||
"name": "Claude: Interview presentation research and preparation",
|
||||
"subsample": "a",
|
||||
"bucket": "mid",
|
||||
"frame_relationships": "The Digital-Physical Interface frame centers the workflow between digital models and physical objects, while Emerging Technologies highlights the specific tools (AI, XR) used in this workflow.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "baseline",
|
||||
"match": true
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Respect Individual Interests for Christmas",
|
||||
"subsample": "a",
|
||||
"bucket": "low",
|
||||
"frame_relationships": "The frames of gift giving and interests/preferences are intertwined as the document discusses choosing gifts based on a recipient's unique hobbies and preferences during the holiday season.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "baseline",
|
||||
"match": true
|
||||
},
|
||||
{
|
||||
"name": "University of North Texas Cover letter.pdf",
|
||||
"subsample": "a",
|
||||
"bucket": "document",
|
||||
"frame_relationships": "The 'Education' and 'Job Application' frames are the primary drivers of the document, with 'Artwork Creation' providing contextual details. The 'Academia' frame serves as a broader context for all other frames.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "taxfree",
|
||||
"match": false
|
||||
},
|
||||
{
|
||||
"name": "Claude: Finding ideal rural housing near University of Utah",
|
||||
"subsample": "a",
|
||||
"bucket": "high",
|
||||
"frame_relationships": "The geographical location and community characteristics inform the housing and transportation options in each area.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "taxfree",
|
||||
"match": false
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: SEC coaches with OSU ties",
|
||||
"subsample": "a",
|
||||
"bucket": "high",
|
||||
"frame_relationships": "",
|
||||
"coverage_score": 0.0,
|
||||
"predicted_tier": "taxfree",
|
||||
"actual_best": "baseline",
|
||||
"match": false
|
||||
},
|
||||
{
|
||||
"name": "Claude: Bonding ASA 3D printed parts",
|
||||
"subsample": "a",
|
||||
"bucket": "mid",
|
||||
"frame_relationships": "The frames 'Bonding' and 'Material_Selection' are active, with the latter providing context for the former by discussing suitable adhesives for bonding 3D printed parts.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "taxfree",
|
||||
"match": false
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Title: User request summary.",
|
||||
"subsample": "a",
|
||||
"bucket": "low",
|
||||
"frame_relationships": "The 'Event' frame is the main context, within which the 'Speech', 'Hosting', and 'Communication' frames are active.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "taxfree",
|
||||
"match": false
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Scholarship Recommendation Letter Tips",
|
||||
"subsample": "a",
|
||||
"bucket": "low",
|
||||
"frame_relationships": "The frames interact to provide a comprehensive assessment of the candidate's qualifications for a scholarship, focusing on both their academic and personal qualities.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "baseline",
|
||||
"match": true
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Job application comparison",
|
||||
"subsample": "b",
|
||||
"bucket": "high",
|
||||
"frame_relationships": "The frames interact by comparing the applicant's qualifications to the job requirements, peer deans' profiles, and the mission of the School of the Arts at Western.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "standard",
|
||||
"match": false
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: External review for tenure",
|
||||
"subsample": "b",
|
||||
"bucket": "high",
|
||||
"frame_relationships": "The Tenure Review frame is the central and overarching context in which the other frames (Academic Evaluation, Peer Review) are activated.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "standard",
|
||||
"match": false
|
||||
},
|
||||
{
|
||||
"name": "Claude: University of Utah interview teaching example",
|
||||
"subsample": "b",
|
||||
"bucket": "high",
|
||||
"frame_relationships": "The document discusses a university course, its curriculum, and related projects; these topics are interconnected as they all revolve around the teaching of computational design in a university setting.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "taxfree",
|
||||
"match": false
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Starting Dropship Gun Business",
|
||||
"subsample": "b",
|
||||
"bucket": "high",
|
||||
"frame_relationships": "The 'GunBusinessSetup' frame involves navigating multiple layers of compliance (FFL, zoning, and state/local laws) while building relationships with distributors and setting up an e-commerce platform.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "taxfree",
|
||||
"match": false
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Analyze business plan",
|
||||
"subsample": "b",
|
||||
"bucket": "high",
|
||||
"frame_relationships": "",
|
||||
"coverage_score": 0.0,
|
||||
"predicted_tier": "taxfree",
|
||||
"actual_best": "taxfree",
|
||||
"match": true
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Outdoor Layering Explained",
|
||||
"subsample": "b",
|
||||
"bucket": "mid",
|
||||
"frame_relationships": "The frames interact by defining layers in an outdoor clothing system for managing sweat, insulation, and protection from weather conditions.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "taxfree",
|
||||
"match": false
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Limits in Calculus.",
|
||||
"subsample": "b",
|
||||
"bucket": "mid",
|
||||
"frame_relationships": "The 'Mathematics' and 'Calculus' frames are active. The 'Functions', 'Limits', and 'Calculus' frames are interconnected, as the document discusses the concept of limits in relation to functions within calculus.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "baseline",
|
||||
"match": true
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Academic Program Director Role",
|
||||
"subsample": "b",
|
||||
"bucket": "mid",
|
||||
"frame_relationships": "The 'academic program' frame underpins the entire document, with 'administration', 'faculty management', and 'student engagement' frames supporting it. The 'curriculum development' frame is a core function that intersects all other frames.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "taxfree",
|
||||
"match": false
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Lonely Island Poop Skit",
|
||||
"subsample": "b",
|
||||
"bucket": "mid",
|
||||
"frame_relationships": "The Music frame is used to deliver comedic content within the Humor frame.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "standard",
|
||||
"match": false
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Parse Tidal playlist",
|
||||
"subsample": "b",
|
||||
"bucket": "mid",
|
||||
"frame_relationships": "",
|
||||
"coverage_score": 0.0,
|
||||
"predicted_tier": "taxfree",
|
||||
"actual_best": "taxfree",
|
||||
"match": true
|
||||
},
|
||||
{
|
||||
"name": "NO thesis proposal.pdf",
|
||||
"subsample": "b",
|
||||
"bucket": "document",
|
||||
"frame_relationships": "The artist's career progression is intertwined with their identity and personal interests, shaping the subject matter and style of their product designs.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "baseline",
|
||||
"match": true
|
||||
},
|
||||
{
|
||||
"name": "PWM.pdf",
|
||||
"subsample": "b",
|
||||
"bucket": "document",
|
||||
"frame_relationships": "The PWM and Analog to Digital Conversion frames are directly related as they both discuss the principles and applications of PWM in relation to analog-to-digital conversion, while the Arduino ADC frame provides specific context regarding the use of these concepts with an Arduino microcontroller.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "baseline",
|
||||
"match": true
|
||||
},
|
||||
{
|
||||
"name": "Will_It_Print.pdf",
|
||||
"subsample": "b",
|
||||
"bucket": "document",
|
||||
"frame_relationships": "The document discusses the design and manufacturing of a part, focusing on its features, material removal, surface finish, support structures, overhangs, internal cavities, tolerances, and stress concentration.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "taxfree",
|
||||
"match": false
|
||||
},
|
||||
{
|
||||
"name": "Kim Kedem Ind Study F2025 Syllabus.docx",
|
||||
"subsample": "b",
|
||||
"bucket": "document",
|
||||
"frame_relationships": "The Education frame serves as the context for the Project and Software Development frames, with the 3D Printing frame acting as a specific application of these skills.",
|
||||
"coverage_score": 1.0,
|
||||
"predicted_tier": "baseline",
|
||||
"actual_best": "baseline",
|
||||
"match": true
|
||||
},
|
||||
{
|
||||
"name": "Aaron Nelson Graduate Transcript.pdf",
|
||||
"subsample": "b",
|
||||
"bucket": "document",
|
||||
"frame_relationships": "",
|
||||
"coverage_score": 0.0,
|
||||
"predicted_tier": "taxfree",
|
||||
"actual_best": "taxfree",
|
||||
"match": true
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,104 @@
|
||||
{
|
||||
"episodes": [
|
||||
{
|
||||
"name": "Claude: Modern website redesign with portfolio and contact form",
|
||||
"content": "[Claude conversation: Modern website redesign with portfolio and contact form]\n\nYou: Q: Want to take some time to browse, or keep moving?\nA: I'll browse those and come back \u2014 let's pause here\n\nClaude: Sounds good, Aaron. Take your time clicking through those \u2014 especially the Architecture & Design community page and the templates. Pay attention to what makes you stop scrolling. When you come back, we can map out your full site architecture, figure out the page structure for both the academic and consulting audiences, and spec out that contact form. Have fun browsing.\n\nYou: oh god so ive notice that a lot of artist websites look like this and I just assumed it was an aesthetic choice but now Im seeing that all the cargo stites look similar [Claude conversation: Modern website redesign with portfolio and contact form]\n\nYou: oh god so ive notice that a lot of artist websites look like this and I just assumed it was an aesthetic choice but now Im seeing that all the cargo stites look similar\n\nClaude: Ha \u2014 yeah, you just pulled back the curtain. That \"clean grid, minimal typography, lots of white space, project thumbnails in a masonry layout\" look that's everywhere in the creative professional world isn't really an aesthetic movement. It's Cargo's design system. Once you see it, you can't unsee it.\n\nThis is actually a real consideration for you, and I want to be honest about it rather than hand-wave it away. On one hand, there's a reason that look is everywhere \u2014 it works, it's legible, it puts the work front and center. On the other hand, you're a designer. If your website looks like every MFA grad's portfolio site, that says something you probably don't want it to say. Especially when you're trying to position yourself as someone who builds things, who leads a fabrication lab, who works at the intersection of computation and manufacturing. That identity is distinct from the gallery-artist-with-a-Cargo-site world.\n\nYou *can* customize Cargo significantly beyond the defau",
|
||||
"category": "recent_conversation"
|
||||
},
|
||||
{
|
||||
"name": "Claude: University of Utah interview teaching example",
|
||||
"content": "[Claude conversation: University of Utah interview teaching example]\n\nYou: I guess the next class would be 2 d computational design. It's a programming for the arts course. We teach primarily in processing in Java.\n\nClaude: Got it \u2014 2D Computational Design, programming for the arts, primarily in Processing/Java. What's next?\n\nYou: We teach an advanced 3 d printing course This goes through advanced slicer stuff. We go through usage of the Stratasys equipment, both, FDM and PolyJet. The students also have the option of taking a, Stratasys certification the additive manufacturing certification at the end of the course. [Claude conversation: University of Utah interview teaching example]\n\nYou: Can we have one with just text only and I can let the PP AI designer have a try\n\nClaude: Here's V4 \u2014 completely plain. No colors, no shapes, no accent bars, no styled cards. Just text on white slides with all the speaker notes intact. PowerPoint Designer should have a field day with this. You've got both versions now \u2014 V3 with my styling if you prefer it, V4 plain for Designer to take a crack at. Upload the finished product whenever you're ready and I'll parse it.\n\nYou: Shit I forgot about this, does this count to something?:\n\nhttps://www.aaronmakingart.com/15-twitter-printer/ [Claude conversation: University of Utah interview teaching example]\n\nYou: Shit I forgot about this, does this count to something?:\n\nhttps://www.aaronmakingart.com/15-twitter-printer/\n\nClaude: Yeah, this counts. It counts in a few ways:\n\n**For the presentation prompt specifically** \u2014 this is a project that combines physical computing (Raspberry Pi, thermal printer, 3D printed housing), custom Python code interfacing with an API, and critical engagement with technology's social impact. It's a designed object that makes an invisible digital phenomenon physically present. That's exactly the kind of work that bridges the technical and the critical \u2014 which is where MDD lives.\n\n**For the room you're presenting to*",
|
||||
"category": "recent_conversation"
|
||||
},
|
||||
{
|
||||
"name": "Claude: Preparing for dinner with Jim Agutter",
|
||||
"content": "[Claude conversation: Preparing for dinner with Jim Agutter]\n\nYou: I\u2019m meeting with Jim for dinner tonight what can you tell me about him\n\nClaude: Jim Agutter is the founder of the Division of Multi-Disciplinary Design (MDD) at the University of Utah and chairs the search committee for the position you're interviewing for. Here's what's worth knowing heading into dinner:\n\n**Background:** He built MDD from the ground up, so he's likely to have strong opinions about the program's identity and direction. As both founder and search committee chair, his read on you will carry significant weight.\n\n**Research angle:** His work sits at the intersection of information visualization and design \u2014 he's done notable research in healthcare data visualization and complex systems display. There's natural overlap with your computational design work, particularly around data-driven design processes.\n\n**Conversation hooks:** Given that he founded a multi-disciplinary program, he'll probably respond well to hearing how you built HVAMC from scratch \u2014 the parallels are obvious (equipment, policy, client engagement, program growth). That shared experience of building something from nothing is a strong connector.\n\n**What to watch for:** As the founder, he may be protective of MDD's culture and trajectory. Listen carefully for what he signals about the program's future direction and where he sees gaps. His questions will likely reveal what he thinks the program needs most.\n\nWant me to search for any recent publications or projects of his, or anything else that might be useful before tonight?\n\nYou: Yes do a thourough search [Claude conversation: Preparing for dinner with Jim Agutter]\n\nYou: Yes do a thourough search\n\nClaude: Let me dig into Jim Agutter's recent work and background thoroughly.Here's the full picture on Jim Agutter going into tonight:\n\n**Current roles (he wears a lot of hats):**\nHis current title is Senior Associate Dean of Faculty Success & Academic Innovation in the Office of",
|
||||
"category": "recent_conversation"
|
||||
},
|
||||
{
|
||||
"name": "Claude: Realtor requirements for home buying in New York",
|
||||
"content": "[Claude conversation: Realtor requirements for home buying in New York]\n\nYou: do I need to have a realtor in ny state to buy a home?\n\nClaude: No, Aaron \u2014 it's perfectly legal in all 50 states to buy a house without using a real estate agent on your end, and New York is no exception. With the necessary due diligence, you can buy a house without a Realtor in New York.\n\nThat said, there's one important distinction for New York specifically: in New York, you must use a real estate attorney to assist with the paperwork and legal aspects of the transaction. This is different from many other states \u2014 NY is one of the states that requires attorney involvement in real estate closings. So while a realtor is optional, a real estate attorney is not.\n\nA few things worth keeping in mind as you're thinking about this up in the Hudson Valley area:\n\n**What a buyer's agent does for you** \u2014 they handle property searches, negotiate on your behalf, and guide you through co-op/condo board processes if applicable. Most home sellers have an agent working on their side of the deal, so it's smart to have someone on your side as a buyer too.\n\n**The cost question** \u2014 buying without an agent is rarely cheaper, as the commission your realtor would collect will likely just be collected by the seller's agent instead. That said, since the 2024 NAR settlement changes, commission structures are more negotiable and buyers may need to negotiate their agent's fee directly.\n\n**If you skip the agent** \u2014 you'll want to lean more heavily on your real estate attorney and be prepared to do your own comparable market analysis, negotiate directly with the seller's agent, and manage the inspection/appraisal timeline yourself.\n\nGiven that you're already in the area and presumably know the local market reasonably well, going without could be viable \u2014 but a good attorney is non-negotiable either way. Want me to dig into anything more specific, like first-time buyer programs in NY or what to look for in a real estat",
|
||||
"category": "recent_conversation"
|
||||
},
|
||||
{
|
||||
"name": "Claude: Memory",
|
||||
"content": "[Claude memory \u2014 what Claude has learned about Aaron]\n\n- {'conversations_memory': '**Work context**\\n\\nAaron Nelson is an Associate Professor of Digital Design & Fabrication and Program Director and Lead Researcher for the Hudson Valley Additive Manufacturing Center (HVAMC) at SUNY New Paltz, a facility he built from the ground up since 2016. He oversees the DDF minor (~50 students), a hybrid graduate program stacked with undergraduates, and industry partnerships with companies including IBM, Braskem, Selux, Sono-Tek, JBT, and Usheco. Aaron is actively weighing whether to remain at SUNY New Paltz \u2014 which faces structural budget deficits and program vulnerability risks \u2014 or pursue a major career transition, with a University of Utah faculty opportunity and an independent consultancy as the leading alternatives.\\n\\n**Personal context**\\n\\nAaron grew up in Tulsa from 1989 to 2001 and has a strong emotional pull back to the city and the broader Oklahoma/Rocky Mountain region. He has a merle border collie named Bird. He is working on a concept album about Tulsa \u2014 modeled on Terry Allen\\'s *Lubbock (on Everything)* \u2014 framed as an autobiography disguised as a breakup album with the city, drawing on his experience as a mixed-race person who felt rejected by Tulsa. He also maintains a range of design and illustration side projects (graffiti lettering, vector art, historical restoration work) and has interest in advanced geometric calculations relevant to fabrication contexts.\\n\\n**Top of mind**\\n\\nAaron has an on-campus interview at the University of Utah\\'s Division of Multi-Disciplinary Design (MDD) scheduled for the third week of April 2026, and interview prep is his primary active focus \u2014 including a concrete teaching example built around his Advanced 3D Printing chocolate casting project (PolyJet printing \u2192 silicone mold-making \u2192 food-safe casting \u2192 packaging and branding), which best demonstrates innovation with mixed-skill, cross-disciplinary students. He is simultane",
|
||||
"category": "recent_conversation"
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Website architecture options",
|
||||
"content": "[Conversation: Website architecture options] Aaron: i want to make a modern webiste for a product with images, video, and e comerce integration. It should have aback end to be easily updateable by non technical staff once it is up in operation. What are the architectures or frameworks I should be looking at. ChatGPT: For what you\u2019re describing, you\u2019re basically choosing between a few big **architectural patterns**, and then some specific **framework + CMS + e-commerce** combos that fit your budget and tolerance for complexity. I\u2019ll lay this out like a menu so you can mix and match. --- ## 1. Start with the big picture: what \u201cshape\u201d of site? You want: - Modern front end (images, video, nice UX) - E-commerce - Non-technical staff can update content/products - Maintainable over time That leads to three common approaches: ### A. All-in-one hosted platform (least dev overhead) **Examples:** - **Shopify** (with a modern theme or custom front end) - **Squarespace** Commerce - **Wix** with e-commerce **Why pick this:** - Fastest to launch. - Non-technical staff can edit almost everything through the UI. - Payments, security, hosting managed for you. - Plenty of modern templates with video, sections, etc. **Trade-offs:** - Less control over custom interactions or weird layouts. - Developer work is mainly theme customization / apps, not full freedom. - Can get \u201cplugin soup\u201d if you bolt on too many apps. This is ideal if you want to focus on product and branding, not dev ops. --- ### B. Headless \u201cJamstack\u201d style (most flexible, very modern) **Concept:** - Front end built with a modern JS framework. - Back end is mostly **APIs**: headless CMS + headless e-commerce + media hosting. **Typical stack:** - **Front end framework**: - **Next.js** (React-based, great defaults, SEO friendly, hybrid static/server rendering) - Or **Nuxt** (Vue-based), **SvelteKit** (Svelte), **Remix** (React) - **CMS for non-technical editors**: - **Sanity**, **Contentful**, **Strapi**, **Prismic**, **Sto",
|
||||
"category": "mid_conversation"
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Meat and longevity tips",
|
||||
"content": "[Conversation: Meat and longevity tips] Aaron: What kind of meat do you eat for longevity ChatGPT: That\u2019s a great question \u2014 the short answer is: there\u2019s no one \u201cmagic meat\u201d that guarantees longevity. But research does suggest that *which* meats you eat, *how much*, *how often*, and *how* they\u2019re prepared all matter a lot for long-term health. Below is a summary of what the science currently suggests, plus practical tips. --- ## What the evidence says Here\u2019s a synthesis of what observational and epidemiological studies show (with the caveat that diet studies are inherently tricky to interpret). ### Meat types & longevity risks - **Red meat (beef, pork, lamb, etc.)** Many large long-term studies have found higher intake of red meat (especially *unprocessed* red meat) to be associated with modestly increased risk of total mortality, cardiovascular disease, and certain cancers. \ue200cite\ue202turn0search3\ue202turn0search9\ue202turn0search2\ue201 Processed red meats (bacon, sausage, deli meats, etc.) have even stronger associations with health risks. \ue200cite\ue202turn0search9\ue202turn0search10\ue202turn0search18\ue201 - **White meat / poultry / lean cuts** Some studies show neutral or even slightly beneficial associations for \u201cwhite\u201d meats (e.g. chicken, turkey) when compared to red meat. \ue200cite\ue202turn0search3\ue201 - **Fish / seafood** Fish is often considered one of the more favorable animal proteins for longevity, especially when it replaces red and processed meats. In dietary patterns tied to longevity, fish often plays a key role. \ue200cite\ue202turn0search17\ue202turn0search11\ue202turn0search30\ue201 - **Very low meat / plant-based diets** Some cohorts suggest that diets with *very low* meat intake (or entirely plant-based) are associated with lower mortality risk over decades. \ue200cite\ue202turn0search5\ue202turn0search7\ue202turn0search11\ue201 However, other research cautions that the data are mostly observational and subject to confounding factors (e.g. overall lifestyle, socioeconomic status, access to health care) so we can\u2019t conclude strong causality. \ue200",
|
||||
"category": "mid_conversation"
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Funk and jazz list",
|
||||
"content": "[Conversation: Funk and jazz list] Aaron: 100 funk and jazz you can include newer artists like meehand street band and vulfpeck i need a list to import into tunemy music ChatGPT: Awesome \u2014 you\u2019re looking for **100 tracks** blending **funk and jazz**, including **newer artists like Meenhan Street Band and Vulfpeck**. That\u2019s a killer combo! I\u2019ll break this into four batches of **25 songs** each so you can import them smoothly into TuneMyMusic. Here\u2019s the first batch of **25 funk and jazz-infused tracks**, mixing **classic and modern artists** to keep things fresh: --- ### \ud83c\udfb7 Funk & Jazz \u2013 Batch 1 of 4 1. Vulfpeck \u2013 **Dean Town** 2. The Mehan Street Band \u2013 **Midnight Morning** 3. Cory Wong \u2013 **Cosmic Sans** 4. Lettuce \u2013 **Phyllis** 5. Ghost-Note \u2013 **Swagism** 6. Snarky Puppy \u2013 **What About Me?** 7. Tower of Power \u2013 **Soul With a Capital \u2018S\u2019** 8. Average White Band \u2013 **Pick Up the Pieces** 9. Herbie Hancock \u2013 **Chameleon** 10. Maceo Parker \u2013 **Shake Everything You\u2019ve Got** 11. The New Mastersounds \u2013 **Your Love is Mine** 12. Stuff \u2013 **Foots** 13. Kamasi Washington \u2013 **Street Fighter Mas** 14. The Greyboy Allstars \u2013 **Still Waiting** 15. Prince \u2013 **Musicology** 16. Cory Henry \u2013 **Trade It All** 17. Parliament \u2013 **Give Up the Funk (Tear the Roof Off the Sucker)** 18. The JB\u2019s \u2013 **Pass the Peas** 19. Scary Pockets \u2013 **Ain\u2019t No Sunshine (feat. Kenton Chen)** 20. The Comet Is Coming \u2013 **Summon the Fire** 21. Roy Hargrove \u2013 **Strasbourg / St. Denis** 22. Soulive \u2013 **El Ron** 23. The Budos Band \u2013 **Up From the South** 24. Khruangbin \u2013 **Maria Tambi\u00e9n** 25. Tom Misch & Yussef Dayes \u2013 **What Kinda Music** --- Would you like the **next 25 tracks**? I\u2019ll continue with a similar energy and variety of classic & contemporary funk-jazz gems. Aaron: I want all 100 in a list that I can import into tune my music ChatGPT: Got it \u2014 here's a **single list of 100 funk and jazz tracks**, curated with a blend of **classic legends** and **new-wave artists** like Vulfpeck, Meenhan Street Band, Co",
|
||||
"category": "mid_conversation"
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Variants for different SKUs",
|
||||
"content": "[Conversation: Variants for different SKUs] Aaron: i am running into an issue that i don;t know how to solve, where it seems like i need to have two products 1 each for the 6 inch and 8 inch segments so they get different skus ChatGPT: Yeah, you\u2019ve basically hit the point where \u201ctwo SKUs, one visual product\u201d becomes a thing \ud83d\ude42 You **don\u2019t have to** create two totally separate products in Sanity/Next for the 6\" and 8\" segments. Instead, you can treat them as **variants of a single product**, each with its own SKU (and price, stock, etc.), while keeping them on **one PDP**. Here\u2019s how I\u2019d handle it in your current setup. --- ## Conceptual answer - Each **sellable thing with a different SKU** should be its own **variant**. - Variants live under one **parent product** (your \u201cSegment\u201d product). - On the site, the user chooses **6\" vs 8\"** from a dropdown, and you use the variant\u2019s SKU when: - sending to your cart / checkout - syncing with inventory - showing price, etc. So you have **one product document** in Sanity like \u201cMossy Segment\u201d and inside it: - Variant A: 6\" / SKU SEG-6IN - Variant B: 8\" / SKU SEG-8IN --- ## 1. Sanity schema: add variants In your `product` schema, you can add a `variants` array and a `productVariant` object: ```ts // schemas/productVariant.ts export default { name: 'productVariant', title: 'Variant', type: 'object', fields: [ { name: 'label', title: 'Label (e.g. 6 inch)', type: 'string', validation: (Rule) => Rule.required(), }, { name: 'lengthInches', title: 'Length (inches)', type: 'number', }, { name: 'sku', title: 'SKU', type: 'string', validation: (Rule) => Rule.required(), }, { name: 'price', title: 'Price', type: 'number', }, { name: 'etsyVariantId', title: 'Etsy Variant ID (optional)', type: 'string', }, ], }; ``` Then in your `product` schema: ```ts // schemas/product.ts export default { name: 'product', title: 'Product', type: 'document', fields: [ // ...existing fields: title, slug, image, kind, etc. { name: 'variants', title: 'Variant",
|
||||
"category": "mid_conversation"
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Hip hop weed songs",
|
||||
"content": "[Conversation: Hip hop weed songs] Aaron: hip hop songs about smoking weed ChatGPT: Nice! You're looking for hip-hop songs about smoking weed \u2014 a classic vibe with tons of iconic tracks. Here\u2019s a **25-song playlist** that blends classic stoner anthems, laid-back beats, and lyrical homages to cannabis culture across different eras of hip-hop: --- ### \ud83c\udf3f **Hip-Hop Weed Anthems \u2013 25 Tracks** \ud83c\udf3f 1. **Dr. Dre \u2013 \"The Next Episode\" (feat. Snoop Dogg)** 2. **Cypress Hill \u2013 \"Hits from the Bong\"** 3. **Wiz Khalifa \u2013 \"Still Blazin\"** 4. **Redman \u2013 \"How to Roll a Blunt\"** 5. **Curren$y \u2013 \"Mary\"** 6. **Devin the Dude \u2013 \"Doobie Ashtray\"** 7. **Bone Thugs-N-Harmony \u2013 \"Buddah Lovaz\"** 8. **Afroman \u2013 \"Because I Got High\"** 9. **Snoop Dogg \u2013 \"Smoke Weed Everyday\" (Outro of \"The Next Episode\")** 10. **Berner \u2013 \"20 Joints\"** 11. **Kid Cudi \u2013 \"Marijuana\"** 12. **Luniz \u2013 \"I Got 5 On It\"** 13. **Method Man & Redman \u2013 \"How High\"** 14. **Ty Dolla $ign \u2013 \"Love U Better\" (feat. Lil Wayne & The-Dream) [weed vibes]** 15. **ScHoolboy Q \u2013 \"Druggys Wit Hoes Again\" (feat. Ab-Soul)** 16. **Juicy J \u2013 \"Smokin' Rollin'\" (feat. Pimp C)** 17. **Flatbush Zombies \u2013 \"Palm Trees\"** 18. **Mac Miller \u2013 \"The Star Room\"** 19. **Freddie Gibbs & Madlib \u2013 \"High\" (feat. Danny Brown)** 20. **Rick Ross \u2013 \"Super High\" (feat. Ne-Yo)** 21. **Outkast \u2013 \"Crumblin' Erb\"** 22. **Joey Bada$$ \u2013 \"Paper Trail$\" [vibe fit, not directly weed-themed]** 23. **Action Bronson \u2013 \"The Rockers\" (feat. Wiz Khalifa)** 24. **Smoke DZA \u2013 \"Personal Party\" (feat. Curren$y)** 25. **Do or Die \u2013 \"Po Pimp\" (feat. Twista & Johnny P)** --- ### Want more? If you\u2019re vibing with this list, I can roll out another batch of 25 tracks \u2014 whether you want more chill beats, trap-heavy weed anthems, or a deeper dive into underground stoner rap. Just say the word \ud83c\udf2c\ufe0f\ud83d\udca8 Would you like a Part 2? Aaron: part 2 ChatGPT: Love it \u2014 let\u2019s keep the rotation going! \ud83c\udf3f\ud83d\udd25 Here's **Part 2** of your hip-hop weed anthem playlist \u2014 another **25 tracks** for those laid-back, cloudy ",
|
||||
"category": "mid_conversation"
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Anvil Jeep with Wheels",
|
||||
"content": "[Conversation: Anvil Jeep with Wheels] Aaron: a anvil coloered jeep jl with white wagon style wheels ChatGPT: Picture a Jeep JL with an anvil gray paint job, giving it a sleek yet rugged look. The matte finish exudes strength and durability, perfectly matching the vehicle's off-road capabilities. The Jeep is outfitted with distinctive white wagon-style wheels that pop against the darker body, offering a nostalgic yet modern contrast. The classic design of the wheels gives a vintage touch, while the Jeep's boxy, muscular stance remains thoroughly contemporary. The vehicle stands ready for adventure, blending both style and function in a unique way. Would you like me to generate an image based on this description? Aaron: yes ChatGPT: { \"size\": \"1024x1024\", \"prompt\": \"A Jeep JL with an anvil gray paint job, featuring a matte finish for a rugged look. The Jeep has distinctive white wagon-style wheels that offer a nostalgic contrast against the darker body. The vehicle has a boxy and muscular stance, evoking a sense of strength and adventure, blending modern design with a touch of vintage style.\" } ChatGPT: Here are the images of the Jeep JL with an anvil gray finish and white wagon-style wheels, blending a modern rugged look with a vintage touch. Let me know if you'd like any adjustments! Aaron: the actual sheels shoudl be white ChatGPT: { \"size\": \"1024x1024\", \"prompt\": \"A Jeep JL with an anvil gray paint job, featuring a matte finish for a rugged look. The Jeep has pure white wagon-style wheels that stand out sharply against the darker body, giving a strong contrast. The vehicle maintains its boxy and muscular stance, exuding both modern design and vintage styling with these distinctive white wheels. The scene emphasizes the combination of strength, adventure, and nostalgic style.\" } ChatGPT: Here are the updated images of the Jeep JL, now featuring the pure white wagon-style wheels that stand out against the anvil gray body. Let me know if there's anything else you'd ",
|
||||
"category": "early_conversation"
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Octopus Beer Line Art",
|
||||
"content": "[Conversation: Octopus Beer Line Art] Aaron: black and white vector line art of an octopus drinking a beer ChatGPT: { \"prompt\": \"A detailed black and white vector line art of an octopus drinking a beer. The octopus has intricate, swirling tentacles, some of which are holding a pint glass filled with beer. The scene is playful, with bubbles rising from the beer and a relaxed expression on the octopus' face. The octopus should have textured skin with fine linework to show detail, and the beer glass should have a frothy head. The image should have a clean, minimalistic design, with precise lines and a modern feel, suitable for a tattoo or illustration.\", \"size\": \"1024x1024\" } ChatGPT: Here are two different black and white vector line art designs of an octopus drinking a beer. If you'd like any adjustments or have more specific preferences, feel free to let me know! Aaron: Too busy ChatGPT: { \"prompt\": \"A simplified black and white vector line art of an octopus drinking a beer. The design should focus on minimalistic, clean lines with fewer details, making the composition less busy. The octopus can have smooth, flowing tentacles, and only one tentacle should hold a pint glass of beer. The beer glass should have a subtle frothy head, and the octopus should have a relaxed expression. Keep the linework thin and uncluttered, with a modern and simple style, suitable for a minimalist illustration or logo.\", \"size\": \"1024x1024\" } ChatGPT: Here are two simplified, minimalist designs of an octopus drinking a beer. Let me know if you'd like further adjustments! Aaron: too simple ChatGPT: { \"prompt\": \"A balanced black and white vector line art of an octopus drinking a beer. The design should have more detail than minimalistic styles but still maintain a clean, organized look. The octopus should have a few gracefully swirling tentacles, with one tentacle holding a pint glass with a frothy head. The octopus should have some light texturing on its skin, with thin, consistent lines, ",
|
||||
"category": "early_conversation"
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Monty Python's Moving On",
|
||||
"content": "[Conversation: Monty Python's Moving On] Aaron: did monty python ever do sketches about moving on ChatGPT: Yes, Monty Python, the British comedy group, created sketches that explored the theme of moving on or moving forward in various contexts. Here are a few examples: 1. The Dead Parrot Sketch: In this famous sketch, a customer played by John Cleese tries to return a dead parrot to a pet shop. Despite the obvious fact that the parrot is deceased, the shopkeeper played by Michael Palin insists that it's simply resting or pining for the fjords. The sketch humorously highlights the frustrations of trying to move on from a faulty purchase or resolve a problem with stubbornness. 2. The Ministry of Silly Walks: This sketch features John Cleese as a civil servant who works in the Ministry of Silly Walks. The premise revolves around absurdly exaggerated and impractical walking styles. Although not explicitly about moving on, it satirizes bureaucracy and the rigid adherence to outdated traditions, hinting at the need to move forward and embrace change. 3. The Lumberjack Song: In this musical sketch, Michael Palin plays a rugged lumberjack who confesses that he wants to be a woman and wear women's clothing. The song humorously depicts his desire for change and moving on from his current profession to pursue a more unconventional path. 4. The Argument Clinic: This sketch portrays a customer, played by Michael Palin, seeking an argument at an argument clinic. John Cleese, who portrays the employee, engages in a series of illogical and infuriating exchanges, making it difficult for the customer to move forward or make progress in their argument. The sketch pokes fun at the futile nature of certain discussions and the challenges of resolving conflicts. These sketches and others created by Monty Python often incorporated themes of absurdity, social commentary, and the struggles of navigating everyday situations, including the desire to move on or deal with obstacles in comedic wa",
|
||||
"category": "early_conversation"
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: Generate Female Brunette Image",
|
||||
"content": "[Conversation: Generate Female Brunette Image] Aaron: I am going to upload an image, please generate an image of a mate, preference for female brunette ChatGPT: {\"prompt\":\"Generate a photorealistic portrait of a white brunette woman in her 30s, matching the level of attractiveness and style of the provided image. She should have a neutral to slight smile expression and a simple indoor background similar to the original setting.\",\"size\":\"512x512\",\"n\":1} ChatGPT: Here\u2019s an AI-generated portrait of a female brunette, roughly the same age and attractiveness level, set in a similar design-focused environment. I matched the lighting and background aesthetic to complement the original image you provided. Aaron: I am going to upload an image, please generate an image of a mate, preference for male brunette ChatGPT: {\"prompt\":\"A close-up portrait of a Caucasian man in his late 20s to early 30s with dark brown hair, neatly styled. He wears a casual button-up shirt and smiles gently at the camera. The setting is a design studio with a 3D printer and architectural models in the background. Soft, diffused lighting highlights his features, and the focus is sharp on his face, matching the style and warmth of the provided reference image.\"} ChatGPT: Here\u2019s an AI-generated portrait featuring a male brunette of a similar age and style, seated in a design studio environment to match the aesthetic of your original image. I\u2019ve aimed to reflect comparable lighting, attire, and background elements so that the two images complement each other. Aaron: make him bald and asian ChatGPT: {\"prompt\":\"Transform the referenced portrait by making the man bald with Asian facial features, while retaining the same professional setting, clothing, and lighting.\",\"size\":\"1024x1024\",\"n\":1,\"referenced_image_ids\":[\"15b170bc-0716-49a9-b3dc-619c3f69394f\"]} ChatGPT: Here\u2019s an edited version of the portrait: a bald, Asian male in the same design studio setting. I framed him with comparable lighting, attire, and ",
|
||||
"category": "early_conversation"
|
||||
},
|
||||
{
|
||||
"name": "ChatGPT: New chat",
|
||||
"content": "[Conversation: New chat] Aaron: python script for grasshoper that explode text ChatGPT: Certainly! Here's an example of a Python script for Grasshopper that explodes text: ```python import rhinoscriptsyntax as rs def explode_text(text): curves = [] objects = rs.ExplodeText(text) if objects: for obj in objects: if rs.IsCurve(obj): curves.append(obj) else: rs.DeleteObject(obj) return curves # Define the text to explode text = \"Hello Grasshopper!\" # Call the function to explode the text result = explode_text(text) # Print the result if result: for curve in result: print(\"Exploded curve:\", curve) else: print(\"No curves were created.\") ``` Make sure you have the Rhinoscript library (`rhinoscriptsyntax`) installed in your Grasshopper Python environment. In this script, the `explode_text` function takes a text string as input and uses the `rs.ExplodeText` function from the Aaron: i want to wrrite a compoent in python for grasshoper that explodes text ChatGPT: Certainly! Here's an example of a Python component for Grasshopper that explodes text: ```python import rhinoscriptsyntax as rs import Grasshopper.Kernel.Parameters as gkparams import Grasshopper.Kernel.Types as gtktypes class ExplodeTextComponent(GhPython.Assemblies.PythonComponent): def __new__(cls): instance = super(ExplodeTextComponent, cls).__new__(cls) instance.curve_output = gkparams.Param_Curve() return instance def get_ComponentGuid(self): return System.Guid(\"GUID-HERE\") def get_ComponentName(self): return \"Explode Text\" def get_ComponentIcon(self): return None def register_InputParams(self, pManager): pManager.AddTextParameter(\"Text\", \"T\", \"Text to explode\", gkparams.Param_Required) def register_OutputParams(self, pManager): pManager.AddParameter(self.curve_output, \"Curves\", \"C\", \"Exploded curves\", gkparams.Param_Output) def solve_Instance(self, DA): text = DA.GetData(0) curves = self.explode_text(text) DA.SetDataList(0, curves) def explode_text(self, text): curves = [] objects = rs.ExplodeText(text) if obje",
|
||||
"category": "early_conversation"
|
||||
},
|
||||
{
|
||||
"name": "Bosch Ind Study SP2022 Syllabus.docx",
|
||||
"content": "Independent Study Industrial Design Client Branding ARS - 495 3 Credit Hours Spring 2022 \u2013 M \u2013 R 3:15pm \u2013 5:00pm Aaron Nelson EIH 103 \u2013 HVAMC Office Hours: M-R 10am \u2013 12pm nelsona@newpaltz.edu This Independent Study will be focused on 3D design branding for visual identity. The student will combine all the skills gained through the DDF program to build a project from the inspiration phase to the completed branding product. Tools and skills used during the project will include use of additive and subtractive fabrication equipment, and assessment of materials. Student Learning Outcomes: Student will synthesize and apply skills that have been built from DDF course to create a completed project with finished parts. Students who participate in the project will: Demonstrate their understanding of the basic concepts, vocabulary, and semantics of design. Identify and analyze for correct materials choice and usage Utilize manufacturing knowledge to generate cost/pricing evaluations for market determination Develop and understanding of client\u2019s visual language and synthesize into compelling 3D design Texts Simplified Complexity by Giancarlo De Marco Algorithms Aided Design by Arturo Tedeschi Grading & Attendance Students will be graded based on the rubric below: Homework will be due at the end of the next class period, unless otherwise specified. Due dates will be given in class. To be considered complete homework must be submitted as follows: Completed files uploaded to Blackboard BEFORE the start of class in which it is due. Files must be your own work. Uploading files sourced whole from the internet or created by anyone other than you will be reported as an academic integrity violation. Files should be formatted as instructed, deviations from naming and format will be counted as points off. Files should include comments and attributions of borrowed code for full points. Late assignments result in 10% deducted per day the assignment is late (ex. if you submitted one day lat",
|
||||
"category": "document"
|
||||
},
|
||||
{
|
||||
"name": "Mod02_Industries_and_Applications.pdf",
|
||||
"content": "INDUSTRIES & APPLICATIONS USING STRATASYS 3D PRINTING Module 2 2 STRATASYS / A GLOBAL LEADER IN APPLIED ADDITIVE TECHNOLOGY SOLUTIONS Learning Objectives: \u2022 Identify 6 key industries leading in additive manufacturing using Stratasys technology. \u2022 Define the benefits and use cases for rapid prototyping, tooling and production parts. \u2022 Identify key FDM printing applications, their value and companies benefiting in the Aerospace Industry, Automotive Industry, Consumer Goods business, Healthcare: Medical and Dental \u2022 Describe the types of tools, printing solution and needs that 3D printing aides and tools have to general manufacturing (fabrication and assembly, health and safety, quality control, workflow) across industries. \u2022 Define Digital Manufacturing applications of Sandcasting, jigs and fixtures, EOAT, blow molding, End-use-parts, Silicone molding, Injection Molding and liquid silicone rubber parts. MODULE 2: INDUSTRIES & APPLICATIONS USING STRATASYS 3 STRATASYS / A GLOBAL LEADER IN APPLIED ADDITIVE TECHNOLOGY SOLUTIONS Additional Resources: \u2022 Embedded Videos \u2022 ACTIVITY Tooling Mini Challenge MODULE 3: INDUSTRIES & APPLICATIONS USING STRATASYS 4 STRATASYS / A GLOBAL LEADER IN APPLIED ADDITIVE TECHNOLOGY SOLUTIONS WHAT MATTERS MORE? OR 5 STRATASYS / A GLOBAL LEADER IN APPLIED ADDITIVE TECHNOLOGY SOLUTIONS https://www.youtube.com/watch?v=K11MWto9tKk INDUSTRIES & APPLICATIONS SHAPING OUR WORLD 6 STRATASYS / A GLOBAL LEADER IN APPLIED ADDITIVE TECHNOLOGY SOLUTIONS INDUSTRIES ManufacturingAerospace Automotive Consumer Products Healthcare 7 AEROSPACE 8 STRATASYS / A GLOBAL LEADER IN APPLIED ADDITIVE TECHNOLOGY SOLUTIONS SHAPING MANUFACTURING & SUPPLY CHAIN IN AEROSPACE APPLICATION VALUE WHO \u2022 Wind tunnel testing \u2022 Carbon fiber layups \u2022 Surrogate parts \u2022 Tooling for composite aero-structures \u2022 Production parts including interior cabin parts \u2022 Lighter weight parts \u2022 Advanced functionality \u2022 Improved fly-to-buy ratio \u2022 Increased supply chain efficiency with reduced invento",
|
||||
"category": "document"
|
||||
},
|
||||
{
|
||||
"name": "CAA Workshop.pdf",
|
||||
"content": "2019 CAA Conference Workshop Parametric Modelling with Rhino and Grasshopper Twisted Vase Demo Aaron Nelson Assistant Professor \u2013 Digital Design and Fabrication Hudson Valley Advanced Manufacturing Center State University of New York at New Paltz Rhino \u2022 Rhino is a surface modeler based on NURBS. \u2022 NURBS is a spline implementation that is the mathematical basis of the NURBS curve (Control Point Curve) \u2022 Curves can be used to describe the edges of Surfaces \u2022 Surfaces are infinitely thin 2D boundary areas \u2022 Multiple surfaces joined at their edges that define a closed 3D region is a Solid Grasshopper and Parametric Models \u2022 Grasshopper is a visual programming language allowing the development of parametric models driven by algorithm within the Rhino environment \u2022 We can perform operations on Rhino geometries referencing them into Grasshopper and passing them into components \u2022 Parametric Models are designs that are driven by input parameters. Adjusting these input parameters changes the geometry of the model as generated by algorithm. This gives us the ability to quickly iterate through form and geometry (form finding) Grasshopper Components Input Components Pass information to other components Container Components Contains Data (usually geometry) Standard Components Performs an operation on some input data Grasshopper Components and Data Flow data flow \u2013 left to right (no looping structures) number slider - input curve container red components are in error state yellow components are in alert state a single line indicates one piece of data passed double lines indicates multiple pieces of data passed dashed lines indicates multiple pieces of data passed as a data tree data from one component can be passed into multiple components The Topology of the Vase The vase is assembled from 5 surfaces \u2022 Surfaces are infinitely thin 2D boundaries \u2022 Surfaces do not exist in 3D space \u2022 To 3D Print the vase we need to define a closed 3D region (Closed Solid) with the surfaces The Top",
|
||||
"category": "document"
|
||||
},
|
||||
{
|
||||
"name": "GRAD_ARF_24-25.pdf",
|
||||
"content": "Graduate, Professional & Interdisciplinary Studies Graduate or Teaching Assistantship Appointment Request Form (ARF) Applicant Information Banner ID: N Sal. Name Major Code____________ GPA _______ Last First MI Address Street City State Zip Telephone New Paltz Email Home or Mobile Position Hiring Dept Residency ___ In-State ___ Out-of-State ___ Foreign (International students must fill out the information below regarding work eligibility) Country of Citizenship: _______________________ VISA Type:_______ _ Do you have the legal right to accept employment in the United States? ___ Yes __ No Proof of identity AND either US Citizenship or employment authorization are required prior to employment. Hiring Timeline Be attentive to your New Paltz email account! Communication regarding your hiring process will be conducted via NP email. More detailed information about the hiring process can be found on the Grad Studies website here. Attestation I hereby authorize investigation of all statement in this application and attached data as provided. I certify that such docu mentation is true and understand that misrepresentation or omission of facts called for in this form may be cause for refusal of employment or termination if I am offered a position. I understand that all graduate employees must satisfy the application and reappointment criteria listed in the Teaching Assistant and Graduate Assistant section of the Graduate, Professional & Interdisciplinary Studies website. If offered a position, I will submit e vidence of matriculation, cumulative GPA of at least 3.00 without grades of F or Incomplete, and registration for at least 6 graduate credits. Additionally, I understand that I must complete all hiring paperwork within 3 days of employment. Signature ________________________________________________________ Date _______________________________________________________ SUNY New Paltz does not discriminate on the basis of race, color, religion, sex, age, national origin, di",
|
||||
"category": "document"
|
||||
},
|
||||
{
|
||||
"name": "Thesis Request.pdf",
|
||||
"content": "8/11 \u2022 38-007 S TAT E U N I V E R S I T Y O F N E W YO R K GRADUATE THESIS REQUEST Office of Records & Registration, SUNY New Paltz, 500 Hawk Drive, New Paltz, NY 12561-2439 F all Spring Summer I Summer II 20_____ Please type or print: __________________________________________________________________ Last Name First MI Student ID Number __________________________________________________________________ _____________________________________ Local Address: Street Apt. No. E-mail __________________________________________________________________ (________) ___________________________ City State Zip Code T elephone Number Course Number ______________________ Abbreviated title of study ____________________________ Credits* _________ * Credits vary from one program to another and students should check the College catalog for credit permitted. Please PRINT Instructor\u2019s name _____________________________________ Instructor ID Number RECOMMENDED BY: _____________________________________________________ ________________________________________________ Signature of Student Date Signature of Instructor Date _____________________________________________________ Signature of Department Chair Date AFFIRMATION OF CHARGES: I, ____________________________________________________________ , have read SUNY New Paltz\u2019s graduate student Continued Registration policy (http://www.newpaltz.edu/graduate/cont_reg_affirmation_of_charges_form_final_11.9.10.pdf). I understand that: \u2022 if I register for Comprehensive Exam Preparation (XXX599) and fail to complete the exam at the end of the semester, or \u2022 if I receive an H grade in a Thesis course, but I\u2019ve completed my other course work required for my degree, I will be automati - cally registered for one credit of Continued Registration each fall and spring semester until I either submit my thesis for a grade or pass the comprehensive exam for my graduate program. Furthermore, I understand that I am responsible for payment of Continued Registrati",
|
||||
"category": "document"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,857 @@
|
||||
{
|
||||
"generated_at": "2026-05-03T23:47:54.802182+00:00",
|
||||
"section_1": {
|
||||
"overall": {
|
||||
"total": 14069,
|
||||
"type_null": 9815,
|
||||
"ca_null": 12109,
|
||||
"both_null": 9815,
|
||||
"both_set": 1960
|
||||
},
|
||||
"cohorts": [
|
||||
{
|
||||
"type": "aaronai_conversation",
|
||||
"ca_null": false,
|
||||
"n": 71
|
||||
},
|
||||
{
|
||||
"type": "chatgpt_conversation",
|
||||
"ca_null": true,
|
||||
"n": 1548
|
||||
},
|
||||
{
|
||||
"type": "claude_conversation",
|
||||
"ca_null": false,
|
||||
"n": 1074
|
||||
},
|
||||
{
|
||||
"type": "claude_memory",
|
||||
"ca_null": true,
|
||||
"n": 1
|
||||
},
|
||||
{
|
||||
"type": "document",
|
||||
"ca_null": false,
|
||||
"n": 815
|
||||
},
|
||||
{
|
||||
"type": "document",
|
||||
"ca_null": true,
|
||||
"n": 745
|
||||
},
|
||||
{
|
||||
"type": null,
|
||||
"ca_null": true,
|
||||
"n": 9815
|
||||
}
|
||||
]
|
||||
},
|
||||
"section_2": {
|
||||
"by_ext": [
|
||||
{
|
||||
"ext": ".pdf",
|
||||
"rows": 6886
|
||||
},
|
||||
{
|
||||
"ext": ".txt",
|
||||
"rows": 1501
|
||||
},
|
||||
{
|
||||
"ext": ".docx",
|
||||
"rows": 1048
|
||||
},
|
||||
{
|
||||
"ext": ".pptx",
|
||||
"rows": 353
|
||||
},
|
||||
{
|
||||
"ext": ".md",
|
||||
"rows": 27
|
||||
}
|
||||
],
|
||||
"classified": 9815,
|
||||
"unclassifiable": 0
|
||||
},
|
||||
"section_3": {
|
||||
"watcher_state_paths": 1462,
|
||||
"watcher_state_basenames": 1183,
|
||||
"watcher_state_collisions": 109,
|
||||
"rows_with_filepath": {
|
||||
"total": 9816,
|
||||
"exists": 9649,
|
||||
"missing": 167,
|
||||
"outside_root": 0,
|
||||
"sample": [
|
||||
{
|
||||
"id": "f317f238_0",
|
||||
"source": "NO thesis proposal.docx",
|
||||
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF790 Thesis/Nic OConnor/NO thesis proposal.docx",
|
||||
"mtime": "2024-01-26T15:06:09Z"
|
||||
},
|
||||
{
|
||||
"id": "81047646_0",
|
||||
"source": "Metals II Syllabus.pdf",
|
||||
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Professional/Job Applications/Job Apps Fall 2015/App State/Metals II Syllabus.pdf",
|
||||
"mtime": "2012-02-26T22:45:15Z"
|
||||
},
|
||||
{
|
||||
"id": "81047646_1",
|
||||
"source": "Metals II Syllabus.pdf",
|
||||
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Professional/Job Applications/Job Apps Fall 2015/App State/Metals II Syllabus.pdf",
|
||||
"mtime": "2012-02-26T22:45:15Z"
|
||||
},
|
||||
{
|
||||
"id": "4e49d3b4_4",
|
||||
"source": "Circuit Intro.pdf",
|
||||
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF310 Mechatronics/Week 1/Circuit Intro.pdf",
|
||||
"mtime": "2022-01-31T23:28:56Z"
|
||||
},
|
||||
{
|
||||
"id": "81047646_2",
|
||||
"source": "Metals II Syllabus.pdf",
|
||||
"filepath": "/home/aaron/nextcloud/data/data/aaron/files/Professional/Job Applications/Job Apps Fall 2015/App State/Metals II Syllabus.pdf",
|
||||
"mtime": "2012-02-26T22:45:15Z"
|
||||
}
|
||||
]
|
||||
},
|
||||
"rows_without_filepath": {
|
||||
"total": 744,
|
||||
"distinct_basenames": 228,
|
||||
"unique_hit": 211,
|
||||
"collision_hit": 16,
|
||||
"unfound": 1
|
||||
},
|
||||
"collision_shapes": {
|
||||
"total": 109,
|
||||
"shape_counts": {
|
||||
"multi-live": 95,
|
||||
"live+archive": 14
|
||||
},
|
||||
"rows_affected_by_shape": {
|
||||
"multi-live": 85,
|
||||
"live+archive": 0
|
||||
},
|
||||
"samples": {
|
||||
"multi-live": [
|
||||
{
|
||||
"name": "README.md",
|
||||
"rows_no_fp_using_this_name": 0,
|
||||
"candidates": [
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/README.md",
|
||||
"mtime": "2026-04-25T17:08:01Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Processing/Nature of Code/The-Nature-of-Code-Examples/The-Nature-of-Code-Examples-master/README.md",
|
||||
"mtime": "2017-03-09T23:32:59Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/samples/hal/README.md",
|
||||
"mtime": "2016-12-21T10:37:05Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/platforms/maven/README.md",
|
||||
"mtime": "2016-12-21T10:37:05Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/README.md",
|
||||
"mtime": "2016-12-21T10:37:03Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/3rdparty/openvx/README.md",
|
||||
"mtime": "2016-12-21T10:37:03Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/3rdparty/openvx/hal/README.md",
|
||||
"mtime": "2016-12-21T10:37:03Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Code/Python/open CV/opencv/sources/3rdparty/carotene/README.md",
|
||||
"mtime": "2016-12-21T10:37:02Z"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "3DPrinting_v2.pptx",
|
||||
"rows_no_fp_using_this_name": 4,
|
||||
"candidates": [
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Presentations/Invited/Innovation Center/3DPrinting_v2.pptx",
|
||||
"mtime": "2026-04-24T19:34:49Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Presentations/Invited/Cuba/Assets/3DPrinting_v2.pptx",
|
||||
"mtime": "2026-04-24T19:34:18Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Presentations/Conference/3D Printing/3DPrinting_v2.pptx",
|
||||
"mtime": "2026-04-24T19:34:15Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Workshops/3DPrinting_v2.pptx",
|
||||
"mtime": "2026-04-24T19:30:14Z"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "Print in Place.docx",
|
||||
"rows_no_fp_using_this_name": 0,
|
||||
"candidates": [
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Academic/DDF205 CAD1/Print in Place.docx",
|
||||
"mtime": "2017-08-24T03:50:36Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Academic/ARS393 CVS1/Print in Place.docx",
|
||||
"mtime": "2015-10-28T20:36:52Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"live+archive": [
|
||||
{
|
||||
"name": "dreamer-design-spec.md",
|
||||
"rows_no_fp_using_this_name": 0,
|
||||
"candidates": [
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Journal/dreamer-design-spec.md",
|
||||
"mtime": "2026-04-25T22:55:11Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Archive/dreamer-design-spec.md",
|
||||
"mtime": "2026-04-25T22:55:11Z"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "BirdAI-Ingest-Architecture.md",
|
||||
"rows_no_fp_using_this_name": 0,
|
||||
"candidates": [
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Journal/BirdAI-Ingest-Architecture.md",
|
||||
"mtime": "2026-04-28T00:08:38Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Archive/BirdAI-Ingest-Architecture.md",
|
||||
"mtime": "2026-04-28T00:08:38Z"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "graphiti-migration-plan.md",
|
||||
"rows_no_fp_using_this_name": 0,
|
||||
"candidates": [
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Journal/graphiti-migration-plan.md",
|
||||
"mtime": "2026-04-27T17:54:40Z"
|
||||
},
|
||||
{
|
||||
"path": "/home/aaron/nextcloud/data/data/aaron/files/Archive/Migration Plans/graphiti-migration-plan.md",
|
||||
"mtime": "2026-04-27T17:54:40Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"section_4": {
|
||||
"export_dir_exists": true,
|
||||
"files": [
|
||||
{
|
||||
"name": "conversations-000.json",
|
||||
"size": 19050556,
|
||||
"mtime": "2026-04-24T19:55:44Z"
|
||||
},
|
||||
{
|
||||
"name": "conversations-001.json",
|
||||
"size": 29057594,
|
||||
"mtime": "2026-04-24T19:55:44Z"
|
||||
}
|
||||
],
|
||||
"convo_index_size": 169,
|
||||
"sample_results": [
|
||||
{
|
||||
"id": "chatgpt_87cc0c47-aaf9-42da-8169-3b8922f3afba_0",
|
||||
"source": "ChatGPT: Dog named Bird",
|
||||
"convo_id": "87cc0c47-aaf9-42da-8169-3b8922f3afba",
|
||||
"create_time": 1708835138.51948,
|
||||
"create_time_iso": "2024-02-25T04:25:38.519480Z",
|
||||
"resolved": true
|
||||
},
|
||||
{
|
||||
"id": "chatgpt_689fab3e-d79c-8333-aeb5-7da4e9ca160d_0",
|
||||
"source": "ChatGPT: Video understanding limitations",
|
||||
"convo_id": "689fab3e-d79c-8333-aeb5-7da4e9ca160d",
|
||||
"create_time": 1755294541.894811,
|
||||
"create_time_iso": "2025-08-15T21:49:01.894811Z",
|
||||
"resolved": true
|
||||
},
|
||||
{
|
||||
"id": "chatgpt_611ff391-7fc0-42ea-bfd9-18dbe1739f19_7",
|
||||
"source": "ChatGPT: Calculating Truncated Cone Angle",
|
||||
"convo_id": "611ff391-7fc0-42ea-bfd9-18dbe1739f19",
|
||||
"create_time": 1724020869.471264,
|
||||
"create_time_iso": "2024-08-18T22:41:09.471264Z",
|
||||
"resolved": true
|
||||
},
|
||||
{
|
||||
"id": "chatgpt_68ce1921-084c-8330-877c-78df1e03e54c_50",
|
||||
"source": "ChatGPT: Soul music playlist ideas",
|
||||
"convo_id": "68ce1921-084c-8330-877c-78df1e03e54c",
|
||||
"create_time": 1758337313.438344,
|
||||
"create_time_iso": "2025-09-20T03:01:53.438344Z",
|
||||
"resolved": true
|
||||
},
|
||||
{
|
||||
"id": "chatgpt_c02e94f0-17db-4fd9-be04-13aaa1b728cb_1",
|
||||
"source": "ChatGPT: Create Rhino plugin in Python",
|
||||
"convo_id": "c02e94f0-17db-4fd9-be04-13aaa1b728cb",
|
||||
"create_time": 1682716259.557353,
|
||||
"create_time_iso": "2023-04-28T21:10:59.557353Z",
|
||||
"resolved": true
|
||||
}
|
||||
],
|
||||
"sample_resolved": 5,
|
||||
"full_cohort": {
|
||||
"distinct_convo_ids": 168,
|
||||
"resolvable_from_export": 168,
|
||||
"unresolvable": 0
|
||||
}
|
||||
},
|
||||
"section_5": {
|
||||
"earliest_per_type": [
|
||||
{
|
||||
"type": "aaronai_conversation",
|
||||
"earliest": "2026-04-26T17:43:28.056503",
|
||||
"latest": "2026-05-03T01:45:21.469613",
|
||||
"rows": 71
|
||||
},
|
||||
{
|
||||
"type": "claude_conversation",
|
||||
"earliest": "2026-02-28T20:33:36.146998Z",
|
||||
"latest": "2026-04-23T04:26:00.015419Z",
|
||||
"rows": 1074
|
||||
},
|
||||
{
|
||||
"type": "document",
|
||||
"earliest": "2026-04-30 16:42:55.360736+00",
|
||||
"latest": "2026-05-03 20:14:33.13663+00",
|
||||
"rows": 815
|
||||
}
|
||||
],
|
||||
"git_findings": [
|
||||
"037d7475738352dd13620486b5154d58fa6c037b 2026-04-28 00:15:46 +0000 chore: archive deprecated chromadb and migration scripts",
|
||||
"67766371789276ec4bcb8bac271b6eb9ddafa888 2026-04-27 05:16:37 +0000 Remove hardcoded PG password fallbacks \u2014 require PG_DSN env var in all scripts",
|
||||
"f78b83042bf2bb3d95c3604ee5d4431e76b103df 2026-04-26 21:16:04 +0000 Migrate to pgvector \u2014 remove ChromaDB from api.py, ingest scripts, dream.py",
|
||||
"8c8fba11b8d1b359b9b7722fc19b6ef562b812d8 2026-04-26 21:28:40 +0000 Add nightly conversation indexing \u2014 Aaron AI conversations into pgvector at 2:30AM",
|
||||
"f78b83042bf2bb3d95c3604ee5d4431e76b103df 2026-04-26 21:16:04 +0000 Migrate to pgvector \u2014 remove ChromaDB from api.py, ingest scripts, dream.py",
|
||||
"d2eed9890665a78a37fb5d336e8af75e7f2acb42 2026-04-26 20:19:49 +0000 Pre-pgvector migration checkpoint \u2014 upsert, allow_replace_deleted, maintenance timer"
|
||||
],
|
||||
"chromadb_candidates": [],
|
||||
"proposed_sentinel": "2026-04-26T00:00:00Z",
|
||||
"reasoning": "git f78b830 'Migrate to pgvector \u2014 remove ChromaDB from api.py, ingest scripts, dream.py' is dated 2026-04-26. The earliest type='document' row with a non-NULL created_at lands 2026-04-30 (the F11 canonical-encoding cutover). Rows with NULL created_at all predate F11 and most predate the pgvector cutover itself. 2026-04-26 is the date the ChromaDB->pgvector migration script was committed, so any row currently in the embeddings table with NULL created_at must have been ingested on or after that date (when the table came into existence in current form). It is the tightest defensible upper bound on 'the row entered pgvector before timestamps were tracked', so it is the right sentinel."
|
||||
},
|
||||
"section_6": [
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "f66c7390_6",
|
||||
"source": "Design Guide - FDM for Composite Tooling 2.0.pdf",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2023-08-24T18:17:01Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "9cf798f8_151",
|
||||
"source": "Shop Class as Soulcraft An inquiry into the value of the -- Crawford, Matthew.pdf",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-30T21:17:40.708026Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "fc378df0_329",
|
||||
"source": "ulysses.txt",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2017-10-12T14:20:59Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "812bd5c6_0",
|
||||
"source": "Bennington College Cover Letter.pdf",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2013-03-29T20:32:23Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "91ccefdd_185",
|
||||
"source": "Cognition in the Wild (A Bradford Book) -- Hutchins, Edwin.pdf",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-25T17:21:35Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "48fa3d53_2",
|
||||
"source": "CMakeLists.txt",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2016-12-21T10:37:05Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "49e3545d_9",
|
||||
"source": "RH50-TM-L1-EN-20140902.pdf",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2014-09-02T18:44:08Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "a8366d89_144",
|
||||
"source": "Hackers and Painters_ Big Ideas from the Computer Age -- Graham, Paul.pdf",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-24T22:25:03Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "3e3097f8_46",
|
||||
"source": "The Nature and Art of Workmanship -- David Pye.pdf",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-24T22:24:03Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "A (type NULL, ca NULL)",
|
||||
"id": "87f9a5cf_269",
|
||||
"source": "Supersizing the Mind_ Embodiment, Action, and Cognitive -- Andy Clark.pdf",
|
||||
"existing_type": null,
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-25T17:14:25Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "cd3d1914_61",
|
||||
"source": "The world beyond your head _ on becoming an individual in an -- Crawford, Matthew B.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-27T16:04:25Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "592a1366_0",
|
||||
"source": "2026-04-29-synthesis.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-29T08:00:57.634567Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "cfb0a691_3",
|
||||
"source": "Consolidator-0.1-Specification.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-29T03:34:31Z",
|
||||
"inferred_ca_source": "watcher_state_unique"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "cd3d1914_57",
|
||||
"source": "The world beyond your head _ on becoming an individual in an -- Crawford, Matthew B.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-27T16:04:25Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "e65ef61c_8",
|
||||
"source": "BirdAI-Research-Context.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-29T15:57:07Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "4dce2922_3",
|
||||
"source": "cascade-optimization-protocol.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-28T05:46:24Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "077cc52d_1",
|
||||
"source": "graphiti-migration-plan.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-27T17:54:40Z",
|
||||
"inferred_ca_source": "watcher_state_collision_pick_latest_of_2"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "db356b14_70",
|
||||
"source": "Finite and infinite games -- James Carse.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-27T06:11:55Z",
|
||||
"inferred_ca_source": "watcher_state_collision_pick_latest_of_2"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "1f15bccf_38",
|
||||
"source": "BirdAI-Experiments-Log.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-05-01T16:40:02Z",
|
||||
"inferred_ca_source": "filepath_stat"
|
||||
},
|
||||
{
|
||||
"cohort": "B-doc-old (type='document', ca NULL)",
|
||||
"id": "db356b14_13",
|
||||
"source": "Finite and infinite games -- James Carse.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-27T06:11:55Z",
|
||||
"inferred_ca_source": "watcher_state_collision_pick_latest_of_2"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_68fd20c6-d838-832d-90f4-154f63281f49_30",
|
||||
"source": "ChatGPT: External review for tenure",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_691d6420-f544-8329-ae4b-f2b78da44c0e_7",
|
||||
"source": "ChatGPT: Website styling changes",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_67fc4254-ef50-8009-9e0f-81864cca7cec_1",
|
||||
"source": "ChatGPT: Job Application Review",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_68f3d936-d74c-8329-91df-fe838e292170_5",
|
||||
"source": "ChatGPT: SEC coaches with OSU ties",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_691d1b5b-bb4c-832b-8d2e-11a86a569fcc_4",
|
||||
"source": "ChatGPT: Hosting app platforms",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_bfa1cd2f-b8ab-4b11-b844-c47b2fa70612_1",
|
||||
"source": "ChatGPT: New chat",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_68ce1921-084c-8330-877c-78df1e03e54c_37",
|
||||
"source": "ChatGPT: Soul music playlist ideas",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_68fd20c6-d838-832d-90f4-154f63281f49_10",
|
||||
"source": "ChatGPT: External review for tenure",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_691d6420-f544-8329-ae4b-f2b78da44c0e_10",
|
||||
"source": "ChatGPT: Website styling changes",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "B-chatgpt (type='chatgpt_conversation', ca NULL)",
|
||||
"id": "chatgpt_690286bd-0758-8332-8491-5d00c77f4696_1",
|
||||
"source": "ChatGPT: Airbrushing and finishing setup",
|
||||
"existing_type": "chatgpt_conversation",
|
||||
"existing_ca": null,
|
||||
"inferred_type": "chatgpt_conversation",
|
||||
"inferred_ca": "2026-04-26T00:00:00Z",
|
||||
"inferred_ca_source": "sentinel"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "6ef0e329_0",
|
||||
"source": "schematic-substrate-analysis.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-05-01 16:42:13.360795+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-05-01 16:42:13.360795+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "02db1224_208",
|
||||
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "ead32317_93",
|
||||
"source": "Richard Sennett - The Craftsman.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-04-30 22:23:34.012202+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-30 22:23:34.012202+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "6ef0e329_4",
|
||||
"source": "schematic-substrate-analysis.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-05-01 16:42:13.360795+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-05-01 16:42:13.360795+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "02db1224_175",
|
||||
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "02db1224_101",
|
||||
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "02db1224_268",
|
||||
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "6ef0e329_5",
|
||||
"source": "schematic-substrate-analysis.md",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-05-01 16:42:13.360795+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-05-01 16:42:13.360795+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "ead32317_132",
|
||||
"source": "Richard Sennett - The Craftsman.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-04-30 22:23:34.012202+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-30 22:23:34.012202+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-doc-new (type='document', ca set)",
|
||||
"id": "02db1224_86",
|
||||
"source": "How Buildings Learn What Happens After They are Built -- Stewart Brand.pdf",
|
||||
"existing_type": "document",
|
||||
"existing_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_type": "document",
|
||||
"inferred_ca": "2026-04-30 22:21:56.211381+00",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-claude (type='claude_conversation', ca set)",
|
||||
"id": "claude_dacf89e3-1ee7-400d-8461-ef5920c82fe3_96",
|
||||
"source": "Claude: University of Utah interview teaching example",
|
||||
"existing_type": "claude_conversation",
|
||||
"existing_ca": "2026-03-11T18:05:57.594832Z",
|
||||
"inferred_type": "claude_conversation",
|
||||
"inferred_ca": "2026-03-11T18:05:57.594832Z",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-claude (type='claude_conversation', ca set)",
|
||||
"id": "claude_c0baf4b0-a7bb-4664-ac7b-98d7b02f56a6_26",
|
||||
"source": "Claude: Weighing Utah versus Oklahoma",
|
||||
"existing_type": "claude_conversation",
|
||||
"existing_ca": "2026-04-01T19:08:26.722197Z",
|
||||
"inferred_type": "claude_conversation",
|
||||
"inferred_ca": "2026-04-01T19:08:26.722197Z",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-claude (type='claude_conversation', ca set)",
|
||||
"id": "claude_bbe0172d-3087-4238-a51c-7dca6c0b6f28_92",
|
||||
"source": "Claude: Setting up a custom OpenClaw instance",
|
||||
"existing_type": "claude_conversation",
|
||||
"existing_ca": "2026-04-23T04:26:00.015419Z",
|
||||
"inferred_type": "claude_conversation",
|
||||
"inferred_ca": "2026-04-23T04:26:00.015419Z",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-claude (type='claude_conversation', ca set)",
|
||||
"id": "claude_42dbddc5-12ba-4de7-a685-043473189da9_6",
|
||||
"source": "Claude: I filling out my annual report...",
|
||||
"existing_type": "claude_conversation",
|
||||
"existing_ca": "2026-03-24T14:34:47.870625Z",
|
||||
"inferred_type": "claude_conversation",
|
||||
"inferred_ca": "2026-03-24T14:34:47.870625Z",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-claude (type='claude_conversation', ca set)",
|
||||
"id": "claude_bbe0172d-3087-4238-a51c-7dca6c0b6f28_1344",
|
||||
"source": "Claude: Setting up a custom OpenClaw instance",
|
||||
"existing_type": "claude_conversation",
|
||||
"existing_ca": "2026-04-23T04:26:00.015419Z",
|
||||
"inferred_type": "claude_conversation",
|
||||
"inferred_ca": "2026-04-23T04:26:00.015419Z",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
|
||||
"id": "aaronai_conv_28ee8a447d3fc922_6",
|
||||
"source": "Aaron AI: I'm working on you",
|
||||
"existing_type": "aaronai_conversation",
|
||||
"existing_ca": "2026-04-26T17:43:28.056503",
|
||||
"inferred_type": "aaronai_conversation",
|
||||
"inferred_ca": "2026-04-26T17:43:28.056503",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
|
||||
"id": "aaronai_conv_7deef2e8001f0e45_20",
|
||||
"source": "Aaron AI: Who's covering for me on sabbatical?",
|
||||
"existing_type": "aaronai_conversation",
|
||||
"existing_ca": "2026-04-29T22:19:45.312349",
|
||||
"inferred_type": "aaronai_conversation",
|
||||
"inferred_ca": "2026-04-29T22:19:45.312349",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
|
||||
"id": "aaronai_conv_21cabf771708df70_42",
|
||||
"source": "Aaron AI: What should I be the most excited about right now?",
|
||||
"existing_type": "aaronai_conversation",
|
||||
"existing_ca": "2026-04-27T07:06:03.996026",
|
||||
"inferred_type": "aaronai_conversation",
|
||||
"inferred_ca": "2026-04-27T07:06:03.996026",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
|
||||
"id": "aaronai_conv_7deef2e8001f0e45_12",
|
||||
"source": "Aaron AI: Who's covering for me on sabbatical?",
|
||||
"existing_type": "aaronai_conversation",
|
||||
"existing_ca": "2026-04-29T22:19:45.312349",
|
||||
"inferred_type": "aaronai_conversation",
|
||||
"inferred_ca": "2026-04-29T22:19:45.312349",
|
||||
"inferred_ca_source": "preserved"
|
||||
},
|
||||
{
|
||||
"cohort": "C-aaronai (type='aaronai_conversation', ca set)",
|
||||
"id": "aaronai_conv_ed40b4278a9c8110_4",
|
||||
"source": "Aaron AI: Let's say you're building an analog of the human brain, and ...",
|
||||
"existing_type": "aaronai_conversation",
|
||||
"existing_ca": "2026-05-03T01:45:21.469613",
|
||||
"inferred_type": "aaronai_conversation",
|
||||
"inferred_ca": "2026-05-03T01:45:21.469613",
|
||||
"inferred_ca_source": "preserved"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,987 @@
|
||||
{
|
||||
"generated_at": "2026-05-03T20:21:33.558462",
|
||||
"n_docs_with_frames": 668,
|
||||
"n_distinct_labels": 1374,
|
||||
"top_30_frames": [
|
||||
[
|
||||
"Education",
|
||||
238
|
||||
],
|
||||
[
|
||||
"Course",
|
||||
58
|
||||
],
|
||||
[
|
||||
"Programming",
|
||||
43
|
||||
],
|
||||
[
|
||||
"Design",
|
||||
32
|
||||
],
|
||||
[
|
||||
"Professional Experience",
|
||||
24
|
||||
],
|
||||
[
|
||||
"Employment",
|
||||
24
|
||||
],
|
||||
[
|
||||
"Research",
|
||||
23
|
||||
],
|
||||
[
|
||||
"3D Printing",
|
||||
22
|
||||
],
|
||||
[
|
||||
"Project",
|
||||
21
|
||||
],
|
||||
[
|
||||
"Grading",
|
||||
21
|
||||
],
|
||||
[
|
||||
"Art",
|
||||
21
|
||||
],
|
||||
[
|
||||
"Budget",
|
||||
21
|
||||
],
|
||||
[
|
||||
"Academic Integrity",
|
||||
20
|
||||
],
|
||||
[
|
||||
"Teaching",
|
||||
19
|
||||
],
|
||||
[
|
||||
"Technology",
|
||||
18
|
||||
],
|
||||
[
|
||||
"Attendance",
|
||||
17
|
||||
],
|
||||
[
|
||||
"Application",
|
||||
15
|
||||
],
|
||||
[
|
||||
"Accommodation",
|
||||
13
|
||||
],
|
||||
[
|
||||
"Manufacturing",
|
||||
13
|
||||
],
|
||||
[
|
||||
"Coursework",
|
||||
11
|
||||
],
|
||||
[
|
||||
"Recommendation",
|
||||
10
|
||||
],
|
||||
[
|
||||
"Manufacturing Process",
|
||||
10
|
||||
],
|
||||
[
|
||||
"Additive Manufacturing",
|
||||
10
|
||||
],
|
||||
[
|
||||
"Job Application",
|
||||
10
|
||||
],
|
||||
[
|
||||
"Exhibitions",
|
||||
10
|
||||
],
|
||||
[
|
||||
"Academic Administration",
|
||||
9
|
||||
],
|
||||
[
|
||||
"Communication",
|
||||
9
|
||||
],
|
||||
[
|
||||
"Course Design",
|
||||
9
|
||||
],
|
||||
[
|
||||
"Veteran and Military Services",
|
||||
9
|
||||
],
|
||||
[
|
||||
"Career",
|
||||
9
|
||||
]
|
||||
],
|
||||
"label_collisions": {
|
||||
"conversational": [
|
||||
[
|
||||
"Conversational",
|
||||
1
|
||||
],
|
||||
[
|
||||
"conversational",
|
||||
1
|
||||
]
|
||||
],
|
||||
"content": [
|
||||
[
|
||||
"Content",
|
||||
1
|
||||
],
|
||||
[
|
||||
"content",
|
||||
1
|
||||
]
|
||||
],
|
||||
"cascade": [
|
||||
[
|
||||
"Cascade",
|
||||
1
|
||||
],
|
||||
[
|
||||
"cascade",
|
||||
1
|
||||
]
|
||||
],
|
||||
"education": [
|
||||
[
|
||||
"Education",
|
||||
238
|
||||
],
|
||||
[
|
||||
"education",
|
||||
1
|
||||
]
|
||||
],
|
||||
"academic record": [
|
||||
[
|
||||
"Academic_Record",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Academic Record",
|
||||
1
|
||||
]
|
||||
],
|
||||
"independent study": [
|
||||
[
|
||||
"Independent Study",
|
||||
5
|
||||
],
|
||||
[
|
||||
"Independent_Study",
|
||||
2
|
||||
]
|
||||
],
|
||||
"project management": [
|
||||
[
|
||||
"Project Management",
|
||||
7
|
||||
],
|
||||
[
|
||||
"Project_Management",
|
||||
1
|
||||
]
|
||||
],
|
||||
"digital fabrication": [
|
||||
[
|
||||
"Digital Fabrication",
|
||||
6
|
||||
],
|
||||
[
|
||||
"digital_fabrication",
|
||||
1
|
||||
],
|
||||
[
|
||||
"digital fabrication",
|
||||
1
|
||||
]
|
||||
],
|
||||
"project proposal": [
|
||||
[
|
||||
"Project_Proposal",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Project Proposal",
|
||||
2
|
||||
]
|
||||
],
|
||||
"academic integrity": [
|
||||
[
|
||||
"Academic Integrity",
|
||||
20
|
||||
],
|
||||
[
|
||||
"Academic_Integrity",
|
||||
2
|
||||
]
|
||||
],
|
||||
"3d printing": [
|
||||
[
|
||||
"3D Printing",
|
||||
22
|
||||
],
|
||||
[
|
||||
"3D_Printing",
|
||||
7
|
||||
]
|
||||
],
|
||||
"technical skills": [
|
||||
[
|
||||
"Technical Skills",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Technical_Skills",
|
||||
1
|
||||
]
|
||||
],
|
||||
"course structure": [
|
||||
[
|
||||
"Course Structure",
|
||||
7
|
||||
],
|
||||
[
|
||||
"Course_Structure",
|
||||
1
|
||||
]
|
||||
],
|
||||
"course design": [
|
||||
[
|
||||
"Course Design",
|
||||
9
|
||||
],
|
||||
[
|
||||
"Course_Design",
|
||||
1
|
||||
]
|
||||
],
|
||||
"product design": [
|
||||
[
|
||||
"Product Design",
|
||||
6
|
||||
],
|
||||
[
|
||||
"Product_Design",
|
||||
1
|
||||
]
|
||||
],
|
||||
"professional experience": [
|
||||
[
|
||||
"Professional Experience",
|
||||
24
|
||||
],
|
||||
[
|
||||
"Professional_Experience",
|
||||
6
|
||||
]
|
||||
],
|
||||
"disability accommodations": [
|
||||
[
|
||||
"Disability Accommodations",
|
||||
4
|
||||
],
|
||||
[
|
||||
"Disability_Accommodations",
|
||||
1
|
||||
]
|
||||
],
|
||||
"material science": [
|
||||
[
|
||||
"Material_Science",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Material Science",
|
||||
4
|
||||
]
|
||||
],
|
||||
"computational design": [
|
||||
[
|
||||
"Computational Design",
|
||||
7
|
||||
],
|
||||
[
|
||||
"Computational_Design",
|
||||
1
|
||||
]
|
||||
],
|
||||
"computer services policy": [
|
||||
[
|
||||
"Computer Services Policy",
|
||||
6
|
||||
],
|
||||
[
|
||||
"Computer_Services_Policy",
|
||||
1
|
||||
]
|
||||
],
|
||||
"work experience": [
|
||||
[
|
||||
"Work_Experience",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Work Experience",
|
||||
3
|
||||
]
|
||||
],
|
||||
"academic program": [
|
||||
[
|
||||
"Academic Program",
|
||||
7
|
||||
],
|
||||
[
|
||||
"Academic_Program",
|
||||
1
|
||||
]
|
||||
],
|
||||
"project-based learning": [
|
||||
[
|
||||
"Project-Based Learning",
|
||||
5
|
||||
],
|
||||
[
|
||||
"Project-Based_Learning",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Project-based Learning",
|
||||
2
|
||||
]
|
||||
],
|
||||
"art and design": [
|
||||
[
|
||||
"Art and Design",
|
||||
6
|
||||
],
|
||||
[
|
||||
"Art_and_Design",
|
||||
1
|
||||
]
|
||||
],
|
||||
"fdm technology": [
|
||||
[
|
||||
"FDM_Technology",
|
||||
2
|
||||
],
|
||||
[
|
||||
"FDM Technology",
|
||||
1
|
||||
]
|
||||
],
|
||||
"material selection": [
|
||||
[
|
||||
"Material_Selection",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Material Selection",
|
||||
1
|
||||
]
|
||||
],
|
||||
"product development": [
|
||||
[
|
||||
"Product Development",
|
||||
6
|
||||
],
|
||||
[
|
||||
"Product_Development",
|
||||
2
|
||||
]
|
||||
],
|
||||
"market research": [
|
||||
[
|
||||
"Market_Research",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Market Research",
|
||||
2
|
||||
]
|
||||
],
|
||||
"computer services": [
|
||||
[
|
||||
"Computer Services",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Computer_Services",
|
||||
1
|
||||
]
|
||||
],
|
||||
"student evaluation of instruction": [
|
||||
[
|
||||
"Student Evaluation of Instruction",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Student_Evaluation_of_Instruction",
|
||||
1
|
||||
]
|
||||
],
|
||||
"course management": [
|
||||
[
|
||||
"Course_Management",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Course Management",
|
||||
1
|
||||
]
|
||||
],
|
||||
"grade policy": [
|
||||
[
|
||||
"Grade_Policy",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Grade Policy",
|
||||
1
|
||||
]
|
||||
],
|
||||
"academic transcript": [
|
||||
[
|
||||
"Academic_Transcript",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Academic Transcript",
|
||||
1
|
||||
]
|
||||
],
|
||||
"evaluation criteria": [
|
||||
[
|
||||
"Evaluation Criteria",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Evaluation_Criteria",
|
||||
1
|
||||
]
|
||||
],
|
||||
"computer science": [
|
||||
[
|
||||
"Computer Science",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Computer_Science",
|
||||
1
|
||||
]
|
||||
],
|
||||
"electrical circuit": [
|
||||
[
|
||||
"Electrical Circuit",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Electrical_Circuit",
|
||||
1
|
||||
]
|
||||
],
|
||||
"digital logic": [
|
||||
[
|
||||
"Digital Logic",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Digital_Logic",
|
||||
1
|
||||
]
|
||||
],
|
||||
"course description": [
|
||||
[
|
||||
"Course Description",
|
||||
3
|
||||
],
|
||||
[
|
||||
"Course_Description",
|
||||
1
|
||||
]
|
||||
],
|
||||
"organizational structure": [
|
||||
[
|
||||
"Organizational_Structure",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Organizational Structure",
|
||||
1
|
||||
]
|
||||
],
|
||||
"digital design": [
|
||||
[
|
||||
"Digital_Design",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Digital Design",
|
||||
4
|
||||
]
|
||||
],
|
||||
"contact information": [
|
||||
[
|
||||
"Contact Information",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Contact_Information",
|
||||
1
|
||||
]
|
||||
],
|
||||
"professional career": [
|
||||
[
|
||||
"Professional_Career",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Professional Career",
|
||||
1
|
||||
]
|
||||
],
|
||||
"personal projects": [
|
||||
[
|
||||
"Personal_Projects",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Personal Projects",
|
||||
2
|
||||
]
|
||||
],
|
||||
"ai development": [
|
||||
[
|
||||
"AI_Development",
|
||||
1
|
||||
],
|
||||
[
|
||||
"AI Development",
|
||||
1
|
||||
]
|
||||
],
|
||||
"university service": [
|
||||
[
|
||||
"University Service",
|
||||
2
|
||||
],
|
||||
[
|
||||
"University_Service",
|
||||
1
|
||||
]
|
||||
],
|
||||
"professional exhibitions and publications": [
|
||||
[
|
||||
"Professional Exhibitions and Publications",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Professional_Exhibitions_and_Publications",
|
||||
1
|
||||
]
|
||||
],
|
||||
"selected external consulting and design work": [
|
||||
[
|
||||
"Selected External Consulting and Design Work",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Selected_External_Consulting_and_Design_Work",
|
||||
2
|
||||
]
|
||||
],
|
||||
"academic career": [
|
||||
[
|
||||
"Academic_Career",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Academic Career",
|
||||
2
|
||||
]
|
||||
],
|
||||
"technology integration": [
|
||||
[
|
||||
"Technology Integration",
|
||||
2
|
||||
],
|
||||
[
|
||||
"Technology_Integration",
|
||||
1
|
||||
]
|
||||
],
|
||||
"artistic practice": [
|
||||
[
|
||||
"Artistic_Practice",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Artistic Practice",
|
||||
1
|
||||
]
|
||||
],
|
||||
"multi-material 3d printing": [
|
||||
[
|
||||
"Multi-Material 3D Printing",
|
||||
1
|
||||
],
|
||||
[
|
||||
"Multi-material 3D Printing",
|
||||
1
|
||||
]
|
||||
],
|
||||
"community engagement": [
|
||||
[
|
||||
"Community Engagement",
|
||||
3
|
||||
],
|
||||
[
|
||||
"Community_Engagement",
|
||||
1
|
||||
]
|
||||
],
|
||||
"digitaldesignandfabrication": [
|
||||
[
|
||||
"DigitalDesignAndFabrication",
|
||||
1
|
||||
],
|
||||
[
|
||||
"DigitalDesignandFabrication",
|
||||
1
|
||||
]
|
||||
],
|
||||
"professional background": [
|
||||
[
|
||||
"Professional Background",
|
||||
3
|
||||
],
|
||||
[
|
||||
"Professional_Background",
|
||||
1
|
||||
]
|
||||
]
|
||||
},
|
||||
"per_doc_frame_count": {
|
||||
"3": 282,
|
||||
"5": 67,
|
||||
"4": 195,
|
||||
"2": 57,
|
||||
"7": 13,
|
||||
"11": 5,
|
||||
"13": 2,
|
||||
"15": 1,
|
||||
"12": 4,
|
||||
"6": 21,
|
||||
"8": 8,
|
||||
"10": 4,
|
||||
"9": 6,
|
||||
"30": 1,
|
||||
"14": 1,
|
||||
"18": 1
|
||||
},
|
||||
"top_30_pairs": [
|
||||
{
|
||||
"a": "Course",
|
||||
"b": "Education",
|
||||
"count": 46
|
||||
},
|
||||
{
|
||||
"a": "Education",
|
||||
"b": "Project",
|
||||
"count": 20
|
||||
},
|
||||
{
|
||||
"a": "Design",
|
||||
"b": "Education",
|
||||
"count": 20
|
||||
},
|
||||
{
|
||||
"a": "Education",
|
||||
"b": "Professional Experience",
|
||||
"count": 20
|
||||
},
|
||||
{
|
||||
"a": "Education",
|
||||
"b": "Employment",
|
||||
"count": 20
|
||||
},
|
||||
{
|
||||
"a": "Education",
|
||||
"b": "Technology",
|
||||
"count": 18
|
||||
},
|
||||
{
|
||||
"a": "Education",
|
||||
"b": "Grading",
|
||||
"count": 17
|
||||
},
|
||||
{
|
||||
"a": "Education",
|
||||
"b": "Research",
|
||||
"count": 15
|
||||
},
|
||||
{
|
||||
"a": "Art",
|
||||
"b": "Education",
|
||||
"count": 15
|
||||
},
|
||||
{
|
||||
"a": "Attendance",
|
||||
"b": "Grading",
|
||||
"count": 14
|
||||
},
|
||||
{
|
||||
"a": "Course",
|
||||
"b": "Grading",
|
||||
"count": 13
|
||||
},
|
||||
{
|
||||
"a": "Academic Integrity",
|
||||
"b": "Education",
|
||||
"count": 11
|
||||
},
|
||||
{
|
||||
"a": "Attendance",
|
||||
"b": "Education",
|
||||
"count": 11
|
||||
},
|
||||
{
|
||||
"a": "Attendance",
|
||||
"b": "Course",
|
||||
"count": 11
|
||||
},
|
||||
{
|
||||
"a": "Application",
|
||||
"b": "Employment",
|
||||
"count": 11
|
||||
},
|
||||
{
|
||||
"a": "Coursework",
|
||||
"b": "Education",
|
||||
"count": 10
|
||||
},
|
||||
{
|
||||
"a": "Course",
|
||||
"b": "Design",
|
||||
"count": 10
|
||||
},
|
||||
{
|
||||
"a": "Course",
|
||||
"b": "Programming",
|
||||
"count": 10
|
||||
},
|
||||
{
|
||||
"a": "Application",
|
||||
"b": "Education",
|
||||
"count": 10
|
||||
},
|
||||
{
|
||||
"a": "Budget",
|
||||
"b": "Education",
|
||||
"count": 10
|
||||
},
|
||||
{
|
||||
"a": "Academic Integrity",
|
||||
"b": "Accommodation",
|
||||
"count": 9
|
||||
},
|
||||
{
|
||||
"a": "Education",
|
||||
"b": "Teaching",
|
||||
"count": 9
|
||||
},
|
||||
{
|
||||
"a": "Education",
|
||||
"b": "Programming",
|
||||
"count": 9
|
||||
},
|
||||
{
|
||||
"a": "Academic Integrity",
|
||||
"b": "Attendance",
|
||||
"count": 9
|
||||
},
|
||||
{
|
||||
"a": "Course",
|
||||
"b": "Project",
|
||||
"count": 8
|
||||
},
|
||||
{
|
||||
"a": "Research",
|
||||
"b": "Teaching",
|
||||
"count": 8
|
||||
},
|
||||
{
|
||||
"a": "Grading",
|
||||
"b": "Project",
|
||||
"count": 7
|
||||
},
|
||||
{
|
||||
"a": "Art",
|
||||
"b": "Technology",
|
||||
"count": 7
|
||||
},
|
||||
{
|
||||
"a": "Academic Integrity",
|
||||
"b": "Course",
|
||||
"count": 7
|
||||
},
|
||||
{
|
||||
"a": "Accommodation",
|
||||
"b": "Course",
|
||||
"count": 7
|
||||
}
|
||||
],
|
||||
"folder_crosstab": {
|
||||
"Education": {
|
||||
"pdf": 116,
|
||||
"docx": 119,
|
||||
"pptx": 3
|
||||
},
|
||||
"Course": {
|
||||
"pdf": 29,
|
||||
"docx": 29
|
||||
},
|
||||
"Programming": {
|
||||
"pptx": 15,
|
||||
"docx": 10,
|
||||
"pdf": 12,
|
||||
"txt": 6
|
||||
},
|
||||
"Design": {
|
||||
"pdf": 13,
|
||||
"docx": 16,
|
||||
"pptx": 3
|
||||
},
|
||||
"Professional Experience": {
|
||||
"docx": 13,
|
||||
"pdf": 11
|
||||
},
|
||||
"Employment": {
|
||||
"pdf": 15,
|
||||
"docx": 9
|
||||
},
|
||||
"Research": {
|
||||
"pdf": 9,
|
||||
"docx": 13,
|
||||
"markdown": 1
|
||||
},
|
||||
"3D Printing": {
|
||||
"docx": 3,
|
||||
"pdf": 11,
|
||||
"pptx": 8
|
||||
},
|
||||
"Project": {
|
||||
"pdf": 8,
|
||||
"docx": 12,
|
||||
"markdown": 1
|
||||
},
|
||||
"Grading": {
|
||||
"pdf": 10,
|
||||
"docx": 11
|
||||
},
|
||||
"Art": {
|
||||
"docx": 11,
|
||||
"pdf": 9,
|
||||
"pptx": 1
|
||||
},
|
||||
"Budget": {
|
||||
"docx": 6,
|
||||
"pdf": 15
|
||||
},
|
||||
"Academic Integrity": {
|
||||
"docx": 17,
|
||||
"pdf": 3
|
||||
},
|
||||
"Teaching": {
|
||||
"pdf": 9,
|
||||
"docx": 10
|
||||
},
|
||||
"Technology": {
|
||||
"docx": 15,
|
||||
"pdf": 3
|
||||
},
|
||||
"Attendance": {
|
||||
"docx": 11,
|
||||
"pdf": 6
|
||||
},
|
||||
"Application": {
|
||||
"pdf": 13,
|
||||
"docx": 2
|
||||
},
|
||||
"Accommodation": {
|
||||
"docx": 11,
|
||||
"pdf": 2
|
||||
},
|
||||
"Manufacturing": {
|
||||
"docx": 6,
|
||||
"pptx": 4,
|
||||
"pdf": 3
|
||||
},
|
||||
"Coursework": {
|
||||
"pdf": 8,
|
||||
"docx": 3
|
||||
}
|
||||
},
|
||||
"bin_totals": {
|
||||
"markdown": 64,
|
||||
"pdf": 286,
|
||||
"pptx": 70,
|
||||
"txt": 28,
|
||||
"docx": 217,
|
||||
"dream_output": 3
|
||||
},
|
||||
"worker_versions": {
|
||||
"2.0": 3,
|
||||
"2.1": 665
|
||||
},
|
||||
"data_gap": {
|
||||
"count": 339,
|
||||
"by_type_bin": {
|
||||
"pdf": 110,
|
||||
"voice_note": 14,
|
||||
"docx": 110,
|
||||
"dream_output": 39,
|
||||
"pptx": 31,
|
||||
"txt": 28,
|
||||
"markdown": 7
|
||||
},
|
||||
"char_length": {
|
||||
"min": 6,
|
||||
"max": 1998,
|
||||
"median": 1077
|
||||
},
|
||||
"sample_sources": [
|
||||
"Thesis Paper Guidlines.pdf",
|
||||
"2026-04-30-17-06-voice.md",
|
||||
"2026-04-30-15-59-voice.md",
|
||||
"2026-04-30-16-53-voice.md",
|
||||
"2026-04-30-16-23-voice.md",
|
||||
"2026-04-29-17-52-voice.md",
|
||||
"2026-04-30-16-59-voice.md",
|
||||
"Outline for 3D Printed Materials for Foundry Casting.docx",
|
||||
"2026-04-26-22-52-voice.md",
|
||||
"2026-04-30-synthesis.md"
|
||||
]
|
||||
},
|
||||
"corpus_coverage": {
|
||||
"total_distinct_sources_in_embeddings": 1255,
|
||||
"conversations_no_frames_by_design": 198,
|
||||
"files_with_frames": 704,
|
||||
"files_short_no_frames": 339,
|
||||
"files_stage2_failed": 12,
|
||||
"frame_coverage_pct": 56.1
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,177 @@
|
||||
{
|
||||
"sample_size": 50,
|
||||
"batch_size": 5,
|
||||
"n_batches": 10,
|
||||
"successful_batches": 4,
|
||||
"failed_batches": 6,
|
||||
"successful_episodes": 20,
|
||||
"failed_episodes": 30,
|
||||
"total_elapsed_s": 283.0,
|
||||
"mean_elapsed_per_episode_s": 5.29,
|
||||
"results": [
|
||||
{
|
||||
"batch_size": 5,
|
||||
"status_code": 200,
|
||||
"elapsed_s": 17.22,
|
||||
"elapsed_per_episode_s": 3.44,
|
||||
"response": {
|
||||
"ok": true,
|
||||
"count": 5
|
||||
},
|
||||
"error": null,
|
||||
"sources": [
|
||||
"ChatGPT: Create new playlist",
|
||||
"04_Annotations.docx",
|
||||
"Macro1.txt",
|
||||
"Interns Fall 22.txt",
|
||||
"Grasshopper Homework 1.pptx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 5,
|
||||
"status_code": 200,
|
||||
"elapsed_s": 15.94,
|
||||
"elapsed_per_episode_s": 3.19,
|
||||
"response": {
|
||||
"ok": true,
|
||||
"count": 5
|
||||
},
|
||||
"error": null,
|
||||
"sources": [
|
||||
"Course Flow Chart.docx",
|
||||
"CertificationActivity_1.1.pdf",
|
||||
"Additional SLO for Graduate Independent Study.docx",
|
||||
"Sarah.pdf",
|
||||
"06_3D_Editing.docx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 5,
|
||||
"status_code": 200,
|
||||
"elapsed_s": 47.24,
|
||||
"elapsed_per_episode_s": 9.45,
|
||||
"response": {
|
||||
"ok": true,
|
||||
"count": 5
|
||||
},
|
||||
"error": null,
|
||||
"sources": [
|
||||
"05_Making things solid.docx",
|
||||
"ANelsonUPS.pdf",
|
||||
"Grad Program Possibilities.docx",
|
||||
"Gregg Navarro Schematic -2019.pdf",
|
||||
"Feedback From Image Cutouts.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 5,
|
||||
"status_code": 500,
|
||||
"elapsed_s": 16.68,
|
||||
"elapsed_per_episode_s": 3.34,
|
||||
"response": null,
|
||||
"error": "{\"detail\":\"Max pending queries exceeded\"}",
|
||||
"sources": [
|
||||
"SCAD Cover.docx",
|
||||
"Penland Studio Coordinator Cover Letter.docx",
|
||||
"References.docx",
|
||||
"List of Accomplishments 2020.pdf",
|
||||
"DDF posters.pdf"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 5,
|
||||
"status_code": 500,
|
||||
"elapsed_s": 53.29,
|
||||
"elapsed_per_episode_s": 10.66,
|
||||
"response": null,
|
||||
"error": "{\"detail\":\"Max pending queries exceeded\"}",
|
||||
"sources": [
|
||||
"CADI_Final Assignment_sample.docx",
|
||||
"FDM Systems and Materials Reference Matrix EN.pdf",
|
||||
"Mod2Quiz.docx",
|
||||
"ChatGPT: Justification for vinyl machine",
|
||||
"Voltage Divider.pptx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 5,
|
||||
"status_code": 200,
|
||||
"elapsed_s": 25.31,
|
||||
"elapsed_per_episode_s": 5.06,
|
||||
"response": {
|
||||
"ok": true,
|
||||
"count": 5
|
||||
},
|
||||
"error": null,
|
||||
"sources": [
|
||||
"02_Point of Curves.docx",
|
||||
"Tent Poles.docx",
|
||||
"ViVOAtHome Spray Booth.pdf",
|
||||
"Alexander_Peraza_38-007 Grad Thesis Request_Spring24.pdf",
|
||||
"README.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 5,
|
||||
"status_code": 500,
|
||||
"elapsed_s": 8.57,
|
||||
"elapsed_per_episode_s": 1.71,
|
||||
"response": null,
|
||||
"error": "{\"detail\":\"Max pending queries exceeded\"}",
|
||||
"sources": [
|
||||
"DDF Program Specific Critical Thinking - AARON.docx",
|
||||
"Course Calender.pdf",
|
||||
"ACAD MIN.docx",
|
||||
"CAA Workshop.pptx",
|
||||
"Thesis Paper Guidlines.docx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 5,
|
||||
"status_code": 500,
|
||||
"elapsed_s": 25.79,
|
||||
"elapsed_per_episode_s": 5.16,
|
||||
"response": null,
|
||||
"error": "{\"detail\":\"Max pending queries exceeded\"}",
|
||||
"sources": [
|
||||
"Computational Media Week 1 Handout.docx",
|
||||
"Aaron_Nelson_CVupdate.docx",
|
||||
"Aaron Nelson - Artist Statement.pdf",
|
||||
"DDF305 Course Increase Fee.docx",
|
||||
"Aaron Nelson Art Resume.docx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 5,
|
||||
"status_code": 500,
|
||||
"elapsed_s": 42.34,
|
||||
"elapsed_per_episode_s": 8.47,
|
||||
"response": null,
|
||||
"error": "{\"detail\":\"Max pending queries exceeded\"}",
|
||||
"sources": [
|
||||
"Dylan McManus Recommendation.docx",
|
||||
"Design Guide - FDM for Composite Tooling 2.0.pdf",
|
||||
"Claude: Art jewelry discourse and contemporary trends",
|
||||
"Aaron Nelson - CV.docx",
|
||||
"Lecture 2 Update.pptx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 5,
|
||||
"status_code": 500,
|
||||
"elapsed_s": 30.6,
|
||||
"elapsed_per_episode_s": 6.12,
|
||||
"response": null,
|
||||
"error": "{\"detail\":\"Max pending queries exceeded\"}",
|
||||
"sources": [
|
||||
"readme.txt",
|
||||
"CADII_PushPullTwist_Final Assignment .docx",
|
||||
"Finn BIles Recommendation_Washington.pdf",
|
||||
"Advanced CAD Syllabus DDF701 V3.docx",
|
||||
"Senior Deisgn 2018.pdf"
|
||||
]
|
||||
}
|
||||
],
|
||||
"total_corpus_sources": 1166,
|
||||
"estimated_migration_hours": 1.7
|
||||
}
|
||||
@@ -0,0 +1,95 @@
|
||||
{
|
||||
"n_retry_sources": 30,
|
||||
"n_batches": 6,
|
||||
"successful_batches": 4,
|
||||
"failed_batches": 2,
|
||||
"successful_episodes": 20,
|
||||
"failed_episodes": 10,
|
||||
"total_elapsed_s": 408.5,
|
||||
"results": [
|
||||
{
|
||||
"batch_size": 5,
|
||||
"status_code": 200,
|
||||
"elapsed_s": 49.19,
|
||||
"elapsed_per_episode_s": 9.84,
|
||||
"error": null,
|
||||
"sources": [
|
||||
"SCAD Cover.docx",
|
||||
"Penland Studio Coordinator Cover Letter.docx",
|
||||
"References.docx",
|
||||
"List of Accomplishments 2020.pdf",
|
||||
"DDF posters.pdf"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 5,
|
||||
"status_code": 200,
|
||||
"elapsed_s": 84.16,
|
||||
"elapsed_per_episode_s": 16.83,
|
||||
"error": null,
|
||||
"sources": [
|
||||
"CADI_Final Assignment_sample.docx",
|
||||
"FDM Systems and Materials Reference Matrix EN.pdf",
|
||||
"Mod2Quiz.docx",
|
||||
"ChatGPT: Justification for vinyl machine",
|
||||
"Voltage Divider.pptx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 5,
|
||||
"status_code": 200,
|
||||
"elapsed_s": 31.0,
|
||||
"elapsed_per_episode_s": 6.2,
|
||||
"error": null,
|
||||
"sources": [
|
||||
"DDF Program Specific Critical Thinking - AARON.docx",
|
||||
"Course Calender.pdf",
|
||||
"ACAD MIN.docx",
|
||||
"CAA Workshop.pptx",
|
||||
"Thesis Paper Guidlines.docx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 5,
|
||||
"status_code": 500,
|
||||
"elapsed_s": 57.85,
|
||||
"elapsed_per_episode_s": 11.57,
|
||||
"error": "{\"detail\":\"Query timed out\"}",
|
||||
"sources": [
|
||||
"Computational Media Week 1 Handout.docx",
|
||||
"Aaron_Nelson_CVupdate.docx",
|
||||
"Aaron Nelson - Artist Statement.pdf",
|
||||
"DDF305 Course Increase Fee.docx",
|
||||
"Aaron Nelson Art Resume.docx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 5,
|
||||
"status_code": 500,
|
||||
"elapsed_s": 66.15,
|
||||
"elapsed_per_episode_s": 13.23,
|
||||
"error": "{\"detail\":\"Query timed out\"}",
|
||||
"sources": [
|
||||
"Dylan McManus Recommendation.docx",
|
||||
"Design Guide - FDM for Composite Tooling 2.0.pdf",
|
||||
"Claude: Art jewelry discourse and contemporary trends",
|
||||
"Aaron Nelson - CV.docx",
|
||||
"Lecture 2 Update.pptx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 5,
|
||||
"status_code": 200,
|
||||
"elapsed_s": 120.1,
|
||||
"elapsed_per_episode_s": 24.02,
|
||||
"error": null,
|
||||
"sources": [
|
||||
"readme.txt",
|
||||
"CADII_PushPullTwist_Final Assignment .docx",
|
||||
"Finn BIles Recommendation_Washington.pdf",
|
||||
"Advanced CAD Syllabus DDF701 V3.docx",
|
||||
"Senior Deisgn 2018.pdf"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,51 @@
|
||||
{
|
||||
"n_sources": 10,
|
||||
"successful_batches": 4,
|
||||
"failed_batches": 0,
|
||||
"successful_episodes": 10,
|
||||
"failed_episodes": 0,
|
||||
"results": [
|
||||
{
|
||||
"batch_size": 3,
|
||||
"status_code": 200,
|
||||
"elapsed_s": 46.49,
|
||||
"error": null,
|
||||
"sources": [
|
||||
"Computational Media Week 1 Handout.docx",
|
||||
"Aaron_Nelson_CVupdate.docx",
|
||||
"Aaron Nelson - Artist Statement.pdf"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 3,
|
||||
"status_code": 200,
|
||||
"elapsed_s": 38.21,
|
||||
"error": null,
|
||||
"sources": [
|
||||
"DDF305 Course Increase Fee.docx",
|
||||
"Aaron Nelson Art Resume.docx",
|
||||
"Dylan McManus Recommendation.docx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 3,
|
||||
"status_code": 200,
|
||||
"elapsed_s": 132.51,
|
||||
"error": null,
|
||||
"sources": [
|
||||
"Design Guide - FDM for Composite Tooling 2.0.pdf",
|
||||
"Claude: Art jewelry discourse and contemporary trends",
|
||||
"Aaron Nelson - CV.docx"
|
||||
]
|
||||
},
|
||||
{
|
||||
"batch_size": 1,
|
||||
"status_code": 200,
|
||||
"elapsed_s": 18.63,
|
||||
"error": null,
|
||||
"sources": [
|
||||
"Lecture 2 Update.pptx"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,4 @@
|
||||
# Local backups created by apply.sh — environment state, not source.
|
||||
# Keeping these out of version control prevents repo bloat and avoids
|
||||
# checking in graphiti-core's Apache-2.0 source under our repo's tree.
|
||||
backups/
|
||||
@@ -0,0 +1,58 @@
|
||||
# graphiti-core Patches — FalkorDB Vector Index Support
|
||||
|
||||
Vendored patches against graphiti-core 0.29.0 adding native FalkorDB
|
||||
vector index support. Three files modified, all under
|
||||
`graphiti_core/driver/falkordb/` and `graphiti_core/graph_queries.py`.
|
||||
No changes to Neo4j or Kuzu code paths.
|
||||
|
||||
## Why this exists
|
||||
|
||||
graphiti-core's FalkorDB driver uses interpreted Cypher cosine math
|
||||
(`vec.cosineDistance(...)`) for similarity search. Each query becomes a
|
||||
full table scan over Entity/RELATES_TO/Community nodes. At ~4,000+
|
||||
entities, single-episode ingest's resolve-against-existing-graph step
|
||||
takes 8+ minutes and bulk ingest hangs FalkorDB. FalkorDB itself
|
||||
supports `db.idx.vector.queryNodes` and `db.idx.vector.queryRelationships`
|
||||
procedures backed by HNSW indexes; graphiti-core's driver doesn't use
|
||||
them.
|
||||
|
||||
These patches:
|
||||
|
||||
1. Add `get_vector_indices()` to `graph_queries.py` returning CREATE
|
||||
VECTOR INDEX statements for FalkorDB on Entity.name_embedding,
|
||||
RELATES_TO.fact_embedding, and Community.name_embedding.
|
||||
2. Extend `falkordb_driver.py:build_indices_and_constraints()` to create
|
||||
the vector indexes alongside range and fulltext indexes.
|
||||
3. Rewrite the three vector-similarity call sites in
|
||||
`falkordb/operations/search_ops.py` to use
|
||||
`db.idx.vector.queryNodes` and `db.idx.vector.queryRelationships`
|
||||
instead of full-scan cosine math. Over-fetches by a configurable
|
||||
multiplier to handle filter rejections.
|
||||
|
||||
## Files
|
||||
|
||||
| Patched file | Source |
|
||||
|---|---|
|
||||
| `graphiti_core/graph_queries.py` | Adds `get_vector_indices()` |
|
||||
| `graphiti_core/driver/falkordb/falkordb_driver.py` | Extends `build_indices_and_constraints` |
|
||||
| `graphiti_core/driver/falkordb/operations/search_ops.py` | Three query rewrites |
|
||||
|
||||
## How to apply
|
||||
|
||||
`./apply.sh` — backs up the originals into `./backups/<timestamp>/`
|
||||
and copies the patched files over.
|
||||
|
||||
## How to revert
|
||||
|
||||
Move the timestamped backup back over the venv:
|
||||
|
||||
cp backups/<ts>/graph_queries.py /home/aaron/aaronai/venv/lib/python3.12/site-packages/graphiti_core/graph_queries.py
|
||||
# ...etc
|
||||
|
||||
## Upstream candidate
|
||||
|
||||
Documented gap (issue #1263 references it indirectly via vector store
|
||||
overlay RFC). Maintainers' attention is on Milvus/external vector DB
|
||||
overlay; this patch is the FalkorDB-native alternative for users who
|
||||
don't want a separate vector DB. Consider PR after empirical validation
|
||||
in production.
|
||||
Executable
+77
@@ -0,0 +1,77 @@
|
||||
#!/usr/bin/env bash
|
||||
# apply.sh — Apply the BirdAI vendored graphiti-core patches.
|
||||
#
|
||||
# Backs up the original venv files into ./backups/<timestamp>/ before
|
||||
# overwriting. The backup directory layout mirrors the venv layout so a
|
||||
# revert is just a tree copy back.
|
||||
#
|
||||
# Usage: ./apply.sh
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
PATCH_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
VENV_BASE="/home/aaron/aaronai/venv/lib/python3.12/site-packages"
|
||||
TIMESTAMP="$(date +%Y%m%d-%H%M%S)"
|
||||
BACKUP_DIR="$PATCH_DIR/backups/$TIMESTAMP"
|
||||
|
||||
# Files to patch — paths relative to graphiti_core/.
|
||||
FILES=(
|
||||
"graph_queries.py"
|
||||
"driver/falkordb_driver.py"
|
||||
"driver/falkordb/operations/search_ops.py"
|
||||
)
|
||||
|
||||
echo "graphiti-core vendored patch apply — BirdAI"
|
||||
echo "Patch directory: $PATCH_DIR"
|
||||
echo "Venv target: $VENV_BASE/graphiti_core/"
|
||||
echo "Backup to: $BACKUP_DIR"
|
||||
echo
|
||||
|
||||
# Pre-flight: confirm all source patch files exist.
|
||||
for rel in "${FILES[@]}"; do
|
||||
if [ ! -f "$PATCH_DIR/graphiti_core/$rel" ]; then
|
||||
echo "ERROR: missing patch file: $PATCH_DIR/graphiti_core/$rel" >&2
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
|
||||
# Pre-flight: confirm all target venv files exist.
|
||||
for rel in "${FILES[@]}"; do
|
||||
if [ ! -f "$VENV_BASE/graphiti_core/$rel" ]; then
|
||||
echo "ERROR: missing venv file: $VENV_BASE/graphiti_core/$rel" >&2
|
||||
echo " graphiti-core may not be installed, or version differs from 0.29.0." >&2
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
|
||||
# Backup originals.
|
||||
echo "[1/3] Backing up originals..."
|
||||
for rel in "${FILES[@]}"; do
|
||||
backup_path="$BACKUP_DIR/graphiti_core/$rel"
|
||||
mkdir -p "$(dirname "$backup_path")"
|
||||
cp "$VENV_BASE/graphiti_core/$rel" "$backup_path"
|
||||
echo " backed up: $rel"
|
||||
done
|
||||
echo
|
||||
|
||||
# Apply patches by copying.
|
||||
echo "[2/3] Applying patches..."
|
||||
for rel in "${FILES[@]}"; do
|
||||
cp "$PATCH_DIR/graphiti_core/$rel" "$VENV_BASE/graphiti_core/$rel"
|
||||
echo " patched: $rel"
|
||||
done
|
||||
echo
|
||||
|
||||
# Sanity check: confirm patched files have the marker.
|
||||
echo "[3/3] Verifying patched files..."
|
||||
for rel in "${FILES[@]}"; do
|
||||
if grep -q "PATCHED 2026-05-02" "$VENV_BASE/graphiti_core/$rel"; then
|
||||
echo " OK: $rel contains patch marker"
|
||||
else
|
||||
echo " WARNING: $rel missing patch marker (may be expected for graph_queries.py — its docstring uses the marker only in the module header)"
|
||||
fi
|
||||
done
|
||||
echo
|
||||
echo "Done. Backup: $BACKUP_DIR"
|
||||
echo "Restart the sidecar to pick up changes:"
|
||||
echo " sudo systemctl restart aaronai-graphiti.service"
|
||||
@@ -0,0 +1,904 @@
|
||||
"""
|
||||
Copyright 2024, Zep Software, Inc.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Any
|
||||
|
||||
from graphiti_core.driver.driver import GraphProvider
|
||||
from graphiti_core.driver.falkordb import STOPWORDS
|
||||
from graphiti_core.driver.operations.search_ops import SearchOperations
|
||||
from graphiti_core.driver.query_executor import QueryExecutor
|
||||
from graphiti_core.driver.record_parsers import (
|
||||
community_node_from_record,
|
||||
entity_edge_from_record,
|
||||
entity_node_from_record,
|
||||
episodic_node_from_record,
|
||||
)
|
||||
from graphiti_core.edges import EntityEdge
|
||||
from graphiti_core.graph_queries import (
|
||||
get_nodes_query,
|
||||
get_relationships_query,
|
||||
get_vector_cosine_func_query,
|
||||
)
|
||||
from graphiti_core.models.edges.edge_db_queries import get_entity_edge_return_query
|
||||
from graphiti_core.models.nodes.node_db_queries import (
|
||||
COMMUNITY_NODE_RETURN,
|
||||
EPISODIC_NODE_RETURN,
|
||||
get_entity_node_return_query,
|
||||
)
|
||||
from graphiti_core.nodes import CommunityNode, EntityNode, EpisodicNode
|
||||
from graphiti_core.search.search_filters import (
|
||||
SearchFilters,
|
||||
edge_search_filter_query_constructor,
|
||||
node_search_filter_query_constructor,
|
||||
)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
MAX_QUERY_LENGTH = 128
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Vector index dispatcher (PATCHED 2026-05-02, BirdAI vendored patch).
|
||||
#
|
||||
# graphiti-core's FalkorDB driver historically composed similarity queries
|
||||
# using `vec.cosineDistance(...)` in interpreted Cypher, which produces a
|
||||
# full-table scan for every search. FalkorDB supports native vector indexes
|
||||
# via `db.idx.vector.queryNodes` and `db.idx.vector.queryRelationships`;
|
||||
# this dispatcher uses them when present and falls back to the cosine math
|
||||
# otherwise.
|
||||
#
|
||||
# Index existence is checked once per (label, attribute, entity_type) and
|
||||
# cached at module scope. The cache should be invalidated whenever
|
||||
# `build_indices_and_constraints` runs (since indexes may have been created
|
||||
# or dropped). FalkorDriver.build_indices_and_constraints is patched to
|
||||
# call `_invalidate_falkordb_vector_index_cache()` after building.
|
||||
#
|
||||
# Over-fetch factor (VECTOR_INDEX_CANDIDATE_MULTIPLIER from graph_queries)
|
||||
# preserves recall when WHERE filters reject some of the top-k candidates.
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
from graphiti_core.graph_queries import (
|
||||
VECTOR_INDEX_CANDIDATE_MULTIPLIER,
|
||||
get_vector_cosine_func_query,
|
||||
)
|
||||
|
||||
# Cache: key = (label, attribute, entity_type), value = bool
|
||||
# entity_type is 'NODE' or 'RELATIONSHIP'.
|
||||
_FALKORDB_VECTOR_INDEX_CACHE: dict[tuple[str, str, str], bool] = {}
|
||||
|
||||
|
||||
def _invalidate_falkordb_vector_index_cache() -> None:
|
||||
"""Clear the vector-index existence cache. Call after build_indices_and_constraints."""
|
||||
_FALKORDB_VECTOR_INDEX_CACHE.clear()
|
||||
|
||||
|
||||
async def _falkordb_vector_index_exists(
|
||||
executor: QueryExecutor,
|
||||
label: str,
|
||||
attribute: str,
|
||||
entity_type: str,
|
||||
) -> bool:
|
||||
"""Check whether a FalkorDB vector index exists for the given target.
|
||||
|
||||
entity_type is 'NODE' for node-label indexes, 'RELATIONSHIP' for edge-type indexes.
|
||||
Result is cached at module scope; call _invalidate_falkordb_vector_index_cache()
|
||||
after building or dropping indexes.
|
||||
"""
|
||||
key = (label, attribute, entity_type)
|
||||
if key in _FALKORDB_VECTOR_INDEX_CACHE:
|
||||
return _FALKORDB_VECTOR_INDEX_CACHE[key]
|
||||
|
||||
try:
|
||||
records, _, _ = await executor.execute_query(
|
||||
"CALL db.indexes() YIELD label, properties, types, entitytype "
|
||||
"RETURN label, properties, types, entitytype"
|
||||
)
|
||||
except Exception as e:
|
||||
# If we cannot enumerate indexes, fall back to "no index" rather than
|
||||
# propagating the error. The fallback cosine-math path is correct,
|
||||
# just slower.
|
||||
logger.warning(f"FalkorDB vector index probe failed; assuming none exist: {e}")
|
||||
_FALKORDB_VECTOR_INDEX_CACHE[key] = False
|
||||
return False
|
||||
|
||||
found = False
|
||||
for r in records:
|
||||
# Records come back as dict-like rows keyed by column name (not
|
||||
# tuples). Access by string keys matching the YIELD clause above.
|
||||
rec_label = r.get('label') if hasattr(r, 'get') else r['label']
|
||||
rec_props = r.get('properties') if hasattr(r, 'get') else r['properties']
|
||||
rec_types = r.get('types') if hasattr(r, 'get') else r['types']
|
||||
rec_entitytype = r.get('entitytype') if hasattr(r, 'get') else r['entitytype']
|
||||
if rec_props is None:
|
||||
rec_props = []
|
||||
if rec_types is None:
|
||||
rec_types = {}
|
||||
|
||||
if rec_label != label:
|
||||
continue
|
||||
if rec_entitytype is not None and rec_entitytype != entity_type:
|
||||
continue
|
||||
if attribute not in rec_props:
|
||||
continue
|
||||
|
||||
# rec_types is a dict like {attribute: ['VECTOR', ...], ...} or sometimes
|
||||
# a flat list — handle both shapes.
|
||||
if isinstance(rec_types, dict):
|
||||
attr_types = rec_types.get(attribute, [])
|
||||
else:
|
||||
attr_types = rec_types
|
||||
if 'VECTOR' in attr_types:
|
||||
found = True
|
||||
break
|
||||
|
||||
_FALKORDB_VECTOR_INDEX_CACHE[key] = found
|
||||
return found
|
||||
|
||||
|
||||
def _falkordb_vector_node_search_cypher(
|
||||
label: str,
|
||||
embedding_attr: str,
|
||||
search_vector_param: str,
|
||||
use_index: bool,
|
||||
) -> tuple[str, str]:
|
||||
"""Build the cypher prefix and node-binding for a node-vector search.
|
||||
|
||||
Returns (prefix, node_var) where:
|
||||
- prefix is the Cypher fragment that binds the node variable and a
|
||||
`score` variable. With index, it's a CALL ... YIELD; without, it's
|
||||
a MATCH plus WITH cosine math.
|
||||
- node_var is the variable name the caller's downstream Cypher should
|
||||
reference (always 'n' here for parity with the existing code).
|
||||
|
||||
The caller appends WHERE filters and RETURN/ORDER BY/LIMIT as usual.
|
||||
The over-fetch parameter `$candidate_k` must be passed by the caller
|
||||
when use_index is True.
|
||||
"""
|
||||
if use_index:
|
||||
return (
|
||||
f"CALL db.idx.vector.queryNodes("
|
||||
f"'{label}', '{embedding_attr}', $candidate_k, vecf32({search_vector_param})"
|
||||
f") YIELD node, score "
|
||||
f"WITH node AS n, score "
|
||||
), "n"
|
||||
# Fallback: original cosine math path
|
||||
cosine = get_vector_cosine_func_query(
|
||||
f"n.{embedding_attr}", search_vector_param, GraphProvider.FALKORDB
|
||||
)
|
||||
return (
|
||||
f"MATCH (n:{label}) "
|
||||
f"WITH n, {cosine} AS score "
|
||||
), "n"
|
||||
|
||||
|
||||
def _falkordb_vector_edge_search_cypher(
|
||||
relationship_type: str,
|
||||
embedding_attr: str,
|
||||
search_vector_param: str,
|
||||
use_index: bool,
|
||||
) -> tuple[str, str]:
|
||||
"""Build the cypher prefix and edge-binding for an edge-vector search.
|
||||
|
||||
Returns (prefix, edge_var). With the index, the procedure binds the
|
||||
relationship variable; we then MATCH source and target via the existing
|
||||
edge to recover (n)-[e]->(m). Without the index, it's the original
|
||||
MATCH-and-cosine path.
|
||||
|
||||
Variable name is 'e' for parity with existing code; source/target are
|
||||
'n' and 'm' respectively, also for parity.
|
||||
"""
|
||||
if use_index:
|
||||
return (
|
||||
f"CALL db.idx.vector.queryRelationships("
|
||||
f"'{relationship_type}', '{embedding_attr}', $candidate_k, vecf32({search_vector_param})"
|
||||
f") YIELD relationship, score "
|
||||
f"MATCH (n:Entity)-[e:{relationship_type}]->(m:Entity) "
|
||||
f"WHERE e = relationship "
|
||||
f"WITH DISTINCT e, n, m, score "
|
||||
), "e"
|
||||
# Fallback
|
||||
cosine = get_vector_cosine_func_query(
|
||||
f"e.{embedding_attr}", search_vector_param, GraphProvider.FALKORDB
|
||||
)
|
||||
return (
|
||||
f"MATCH (n:Entity)-[e:{relationship_type}]->(m:Entity) "
|
||||
f"WITH DISTINCT e, n, m, {cosine} AS score "
|
||||
), "e"
|
||||
|
||||
|
||||
|
||||
# FalkorDB separator characters that break text into tokens
|
||||
_SEPARATOR_MAP = str.maketrans(
|
||||
{
|
||||
',': ' ',
|
||||
'.': ' ',
|
||||
'<': ' ',
|
||||
'>': ' ',
|
||||
'{': ' ',
|
||||
'}': ' ',
|
||||
'[': ' ',
|
||||
']': ' ',
|
||||
'"': ' ',
|
||||
"'": ' ',
|
||||
':': ' ',
|
||||
';': ' ',
|
||||
'!': ' ',
|
||||
'@': ' ',
|
||||
'#': ' ',
|
||||
'$': ' ',
|
||||
'%': ' ',
|
||||
'^': ' ',
|
||||
'&': ' ',
|
||||
'*': ' ',
|
||||
'(': ' ',
|
||||
')': ' ',
|
||||
'-': ' ',
|
||||
'+': ' ',
|
||||
'=': ' ',
|
||||
'~': ' ',
|
||||
'?': ' ',
|
||||
'|': ' ',
|
||||
'/': ' ',
|
||||
'\\': ' ',
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
def _sanitize(query: str) -> str:
|
||||
"""Replace FalkorDB special characters with whitespace."""
|
||||
sanitized = query.translate(_SEPARATOR_MAP)
|
||||
return ' '.join(sanitized.split())
|
||||
|
||||
|
||||
def _build_falkor_fulltext_query(
|
||||
query: str,
|
||||
group_ids: list[str] | None = None,
|
||||
max_query_length: int = MAX_QUERY_LENGTH,
|
||||
) -> str:
|
||||
"""Build a fulltext query string for FalkorDB using RedisSearch syntax."""
|
||||
if group_ids is None or len(group_ids) == 0:
|
||||
group_filter = ''
|
||||
else:
|
||||
escaped_group_ids = [f'"{gid}"' for gid in group_ids]
|
||||
group_values = '|'.join(escaped_group_ids)
|
||||
group_filter = f'(@group_id:{group_values})'
|
||||
|
||||
sanitized_query = _sanitize(query)
|
||||
|
||||
# Remove stopwords and empty tokens
|
||||
query_words = sanitized_query.split()
|
||||
filtered_words = [word for word in query_words if word and word.lower() not in STOPWORDS]
|
||||
sanitized_query = ' | '.join(filtered_words)
|
||||
|
||||
if len(sanitized_query.split(' ')) + len(group_ids or '') >= max_query_length:
|
||||
return ''
|
||||
|
||||
full_query = group_filter + ' (' + sanitized_query + ')'
|
||||
return full_query
|
||||
|
||||
|
||||
class FalkorSearchOperations(SearchOperations):
|
||||
# --- Node search ---
|
||||
|
||||
async def node_fulltext_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
query: str,
|
||||
search_filter: SearchFilters,
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
) -> list[EntityNode]:
|
||||
fuzzy_query = _build_falkor_fulltext_query(query, group_ids)
|
||||
if fuzzy_query == '':
|
||||
return []
|
||||
|
||||
filter_queries, filter_params = node_search_filter_query_constructor(
|
||||
search_filter, GraphProvider.FALKORDB
|
||||
)
|
||||
|
||||
if group_ids is not None:
|
||||
filter_queries.append('n.group_id IN $group_ids')
|
||||
filter_params['group_ids'] = group_ids
|
||||
|
||||
filter_query = ''
|
||||
if filter_queries:
|
||||
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
|
||||
|
||||
cypher = (
|
||||
get_nodes_query(
|
||||
'node_name_and_summary', '$query', limit=limit, provider=GraphProvider.FALKORDB
|
||||
)
|
||||
+ 'YIELD node AS n, score'
|
||||
+ filter_query
|
||||
+ """
|
||||
WITH n, score
|
||||
ORDER BY score DESC
|
||||
LIMIT $limit
|
||||
RETURN
|
||||
"""
|
||||
+ get_entity_node_return_query(GraphProvider.FALKORDB)
|
||||
)
|
||||
|
||||
records, _, _ = await executor.execute_query(
|
||||
cypher,
|
||||
query=fuzzy_query,
|
||||
limit=limit,
|
||||
**filter_params,
|
||||
)
|
||||
|
||||
return [entity_node_from_record(r) for r in records]
|
||||
|
||||
async def node_similarity_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
search_vector: list[float],
|
||||
search_filter: SearchFilters,
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
min_score: float = 0.6,
|
||||
) -> list[EntityNode]:
|
||||
filter_queries, filter_params = node_search_filter_query_constructor(
|
||||
search_filter, GraphProvider.FALKORDB
|
||||
)
|
||||
|
||||
if group_ids is not None:
|
||||
filter_queries.append('n.group_id IN $group_ids')
|
||||
filter_params['group_ids'] = group_ids
|
||||
|
||||
filter_query = ''
|
||||
if filter_queries:
|
||||
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
|
||||
|
||||
# PATCHED 2026-05-02 (BirdAI vendored patch): use FalkorDB native vector
|
||||
# index when available; fall back to interpreted-Cypher cosine math
|
||||
# otherwise. The filter clause's position changes between paths
|
||||
# (after MATCH for fallback, after YIELD for index path), but the
|
||||
# filter expressions themselves are identical because they reference
|
||||
# the bound variable `n` either way.
|
||||
use_index = await _falkordb_vector_index_exists(
|
||||
executor, 'Entity', 'name_embedding', 'NODE'
|
||||
)
|
||||
prefix, _ = _falkordb_vector_node_search_cypher(
|
||||
'Entity', 'name_embedding', '$search_vector', use_index
|
||||
)
|
||||
where_clauses = []
|
||||
if filter_query:
|
||||
where_clauses.append(filter_query.replace(' WHERE ', '', 1).strip())
|
||||
where_clauses.append('score > $min_score')
|
||||
unified_where = ' WHERE ' + ' AND '.join(where_clauses)
|
||||
|
||||
cypher = (
|
||||
prefix
|
||||
+ unified_where
|
||||
+ """
|
||||
RETURN
|
||||
"""
|
||||
+ get_entity_node_return_query(GraphProvider.FALKORDB)
|
||||
+ """
|
||||
ORDER BY score DESC
|
||||
LIMIT $limit
|
||||
"""
|
||||
)
|
||||
params = dict(
|
||||
search_vector=search_vector,
|
||||
limit=limit,
|
||||
min_score=min_score,
|
||||
**filter_params,
|
||||
)
|
||||
if use_index:
|
||||
params['candidate_k'] = limit * VECTOR_INDEX_CANDIDATE_MULTIPLIER
|
||||
records, _, _ = await executor.execute_query(cypher, **params)
|
||||
|
||||
return [entity_node_from_record(r) for r in records]
|
||||
|
||||
async def node_bfs_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
origin_uuids: list[str],
|
||||
search_filter: SearchFilters,
|
||||
max_depth: int,
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
) -> list[EntityNode]:
|
||||
if not origin_uuids or max_depth < 1:
|
||||
return []
|
||||
|
||||
filter_queries, filter_params = node_search_filter_query_constructor(
|
||||
search_filter, GraphProvider.FALKORDB
|
||||
)
|
||||
|
||||
if group_ids is not None:
|
||||
filter_queries.append('n.group_id IN $group_ids')
|
||||
filter_queries.append('origin.group_id IN $group_ids')
|
||||
filter_params['group_ids'] = group_ids
|
||||
|
||||
filter_query = ''
|
||||
if filter_queries:
|
||||
filter_query = ' AND ' + (' AND '.join(filter_queries))
|
||||
|
||||
cypher = (
|
||||
f"""
|
||||
UNWIND $bfs_origin_node_uuids AS origin_uuid
|
||||
MATCH (origin {{uuid: origin_uuid}})-[:RELATES_TO|MENTIONS*1..{max_depth}]->(n:Entity)
|
||||
WHERE n.group_id = origin.group_id
|
||||
"""
|
||||
+ filter_query
|
||||
+ """
|
||||
RETURN
|
||||
"""
|
||||
+ get_entity_node_return_query(GraphProvider.FALKORDB)
|
||||
+ """
|
||||
LIMIT $limit
|
||||
"""
|
||||
)
|
||||
|
||||
records, _, _ = await executor.execute_query(
|
||||
cypher,
|
||||
bfs_origin_node_uuids=origin_uuids,
|
||||
limit=limit,
|
||||
**filter_params,
|
||||
)
|
||||
|
||||
return [entity_node_from_record(r) for r in records]
|
||||
|
||||
# --- Edge search ---
|
||||
|
||||
async def edge_fulltext_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
query: str,
|
||||
search_filter: SearchFilters,
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
) -> list[EntityEdge]:
|
||||
fuzzy_query = _build_falkor_fulltext_query(query, group_ids)
|
||||
if fuzzy_query == '':
|
||||
return []
|
||||
|
||||
filter_queries, filter_params = edge_search_filter_query_constructor(
|
||||
search_filter, GraphProvider.FALKORDB
|
||||
)
|
||||
|
||||
if group_ids is not None:
|
||||
filter_queries.append('e.group_id IN $group_ids')
|
||||
filter_params['group_ids'] = group_ids
|
||||
|
||||
filter_query = ''
|
||||
if filter_queries:
|
||||
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
|
||||
|
||||
cypher = (
|
||||
get_relationships_query(
|
||||
'edge_name_and_fact', limit=limit, provider=GraphProvider.FALKORDB
|
||||
)
|
||||
+ """
|
||||
YIELD relationship AS rel, score
|
||||
MATCH (n:Entity)-[e:RELATES_TO {uuid: rel.uuid}]->(m:Entity)
|
||||
"""
|
||||
+ filter_query
|
||||
+ """
|
||||
WITH e, score, n, m
|
||||
RETURN
|
||||
"""
|
||||
+ get_entity_edge_return_query(GraphProvider.FALKORDB)
|
||||
+ """
|
||||
ORDER BY score DESC
|
||||
LIMIT $limit
|
||||
"""
|
||||
)
|
||||
|
||||
records, _, _ = await executor.execute_query(
|
||||
cypher,
|
||||
query=fuzzy_query,
|
||||
limit=limit,
|
||||
**filter_params,
|
||||
)
|
||||
|
||||
return [entity_edge_from_record(r) for r in records]
|
||||
|
||||
async def edge_similarity_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
search_vector: list[float],
|
||||
source_node_uuid: str | None,
|
||||
target_node_uuid: str | None,
|
||||
search_filter: SearchFilters,
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
min_score: float = 0.6,
|
||||
) -> list[EntityEdge]:
|
||||
filter_queries, filter_params = edge_search_filter_query_constructor(
|
||||
search_filter, GraphProvider.FALKORDB
|
||||
)
|
||||
|
||||
if group_ids is not None:
|
||||
filter_queries.append('e.group_id IN $group_ids')
|
||||
filter_params['group_ids'] = group_ids
|
||||
|
||||
if source_node_uuid is not None:
|
||||
filter_params['source_uuid'] = source_node_uuid
|
||||
filter_queries.append('n.uuid = $source_uuid')
|
||||
|
||||
if target_node_uuid is not None:
|
||||
filter_params['target_uuid'] = target_node_uuid
|
||||
filter_queries.append('m.uuid = $target_uuid')
|
||||
|
||||
filter_query = ''
|
||||
if filter_queries:
|
||||
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
|
||||
|
||||
# PATCHED 2026-05-02 (BirdAI vendored patch): use FalkorDB native vector
|
||||
# index on RELATES_TO.fact_embedding when available. The unindexed
|
||||
# fallback is the same MATCH-and-cosine math that previously hung
|
||||
# for 6+ minutes on a 4,000-entity graph; this is the load-bearing
|
||||
# call site that motivated the patch.
|
||||
use_index = await _falkordb_vector_index_exists(
|
||||
executor, 'RELATES_TO', 'fact_embedding', 'RELATIONSHIP'
|
||||
)
|
||||
prefix, _ = _falkordb_vector_edge_search_cypher(
|
||||
'RELATES_TO', 'fact_embedding', '$search_vector', use_index
|
||||
)
|
||||
where_clauses = []
|
||||
if filter_query:
|
||||
where_clauses.append(filter_query.replace(' WHERE ', '', 1).strip())
|
||||
where_clauses.append('score > $min_score')
|
||||
unified_where = ' WHERE ' + ' AND '.join(where_clauses)
|
||||
|
||||
cypher = (
|
||||
prefix
|
||||
+ unified_where
|
||||
+ """
|
||||
RETURN
|
||||
"""
|
||||
+ get_entity_edge_return_query(GraphProvider.FALKORDB)
|
||||
+ """
|
||||
ORDER BY score DESC
|
||||
LIMIT $limit
|
||||
"""
|
||||
)
|
||||
params = dict(
|
||||
search_vector=search_vector,
|
||||
limit=limit,
|
||||
min_score=min_score,
|
||||
**filter_params,
|
||||
)
|
||||
if use_index:
|
||||
params['candidate_k'] = limit * VECTOR_INDEX_CANDIDATE_MULTIPLIER
|
||||
records, _, _ = await executor.execute_query(cypher, **params)
|
||||
|
||||
return [entity_edge_from_record(r) for r in records]
|
||||
|
||||
async def edge_bfs_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
origin_uuids: list[str],
|
||||
max_depth: int,
|
||||
search_filter: SearchFilters,
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
) -> list[EntityEdge]:
|
||||
if not origin_uuids:
|
||||
return []
|
||||
|
||||
filter_queries, filter_params = edge_search_filter_query_constructor(
|
||||
search_filter, GraphProvider.FALKORDB
|
||||
)
|
||||
|
||||
if group_ids is not None:
|
||||
filter_queries.append('e.group_id IN $group_ids')
|
||||
filter_params['group_ids'] = group_ids
|
||||
|
||||
filter_query = ''
|
||||
if filter_queries:
|
||||
filter_query = ' WHERE ' + (' AND '.join(filter_queries))
|
||||
|
||||
cypher = (
|
||||
f"""
|
||||
UNWIND $bfs_origin_node_uuids AS origin_uuid
|
||||
MATCH path = (origin {{uuid: origin_uuid}})-[:RELATES_TO|MENTIONS*1..{max_depth}]->(:Entity)
|
||||
UNWIND relationships(path) AS rel
|
||||
MATCH (n:Entity)-[e:RELATES_TO {{uuid: rel.uuid}}]-(m:Entity)
|
||||
"""
|
||||
+ filter_query
|
||||
+ """
|
||||
RETURN DISTINCT
|
||||
"""
|
||||
+ get_entity_edge_return_query(GraphProvider.FALKORDB)
|
||||
+ """
|
||||
LIMIT $limit
|
||||
"""
|
||||
)
|
||||
|
||||
records, _, _ = await executor.execute_query(
|
||||
cypher,
|
||||
bfs_origin_node_uuids=origin_uuids,
|
||||
depth=max_depth,
|
||||
limit=limit,
|
||||
**filter_params,
|
||||
)
|
||||
|
||||
return [entity_edge_from_record(r) for r in records]
|
||||
|
||||
# --- Episode search ---
|
||||
|
||||
async def episode_fulltext_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
query: str,
|
||||
search_filter: SearchFilters, # noqa: ARG002
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
) -> list[EpisodicNode]:
|
||||
fuzzy_query = _build_falkor_fulltext_query(query, group_ids)
|
||||
if fuzzy_query == '':
|
||||
return []
|
||||
|
||||
filter_params: dict[str, Any] = {}
|
||||
group_filter_query = ''
|
||||
if group_ids is not None:
|
||||
group_filter_query += '\nAND e.group_id IN $group_ids'
|
||||
filter_params['group_ids'] = group_ids
|
||||
|
||||
cypher = (
|
||||
get_nodes_query(
|
||||
'episode_content', '$query', limit=limit, provider=GraphProvider.FALKORDB
|
||||
)
|
||||
+ """
|
||||
YIELD node AS episode, score
|
||||
MATCH (e:Episodic)
|
||||
WHERE e.uuid = episode.uuid
|
||||
"""
|
||||
+ group_filter_query
|
||||
+ """
|
||||
RETURN
|
||||
"""
|
||||
+ EPISODIC_NODE_RETURN
|
||||
+ """
|
||||
ORDER BY score DESC
|
||||
LIMIT $limit
|
||||
"""
|
||||
)
|
||||
|
||||
records, _, _ = await executor.execute_query(
|
||||
cypher, query=fuzzy_query, limit=limit, **filter_params
|
||||
)
|
||||
|
||||
return [episodic_node_from_record(r) for r in records]
|
||||
|
||||
# --- Community search ---
|
||||
|
||||
async def community_fulltext_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
query: str,
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
) -> list[CommunityNode]:
|
||||
fuzzy_query = _build_falkor_fulltext_query(query, group_ids)
|
||||
if fuzzy_query == '':
|
||||
return []
|
||||
|
||||
filter_params: dict[str, Any] = {}
|
||||
group_filter_query = ''
|
||||
if group_ids is not None:
|
||||
group_filter_query = 'WHERE c.group_id IN $group_ids'
|
||||
filter_params['group_ids'] = group_ids
|
||||
|
||||
cypher = (
|
||||
get_nodes_query(
|
||||
'community_name', '$query', limit=limit, provider=GraphProvider.FALKORDB
|
||||
)
|
||||
+ """
|
||||
YIELD node AS c, score
|
||||
WITH c, score
|
||||
"""
|
||||
+ group_filter_query
|
||||
+ """
|
||||
RETURN
|
||||
"""
|
||||
+ COMMUNITY_NODE_RETURN
|
||||
+ """
|
||||
ORDER BY score DESC
|
||||
LIMIT $limit
|
||||
"""
|
||||
)
|
||||
|
||||
records, _, _ = await executor.execute_query(
|
||||
cypher, query=fuzzy_query, limit=limit, **filter_params
|
||||
)
|
||||
|
||||
return [community_node_from_record(r) for r in records]
|
||||
|
||||
async def community_similarity_search(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
search_vector: list[float],
|
||||
group_ids: list[str] | None = None,
|
||||
limit: int = 10,
|
||||
min_score: float = 0.6,
|
||||
) -> list[CommunityNode]:
|
||||
query_params: dict[str, Any] = {}
|
||||
|
||||
group_filter_query = ''
|
||||
if group_ids is not None:
|
||||
group_filter_query += ' WHERE c.group_id IN $group_ids'
|
||||
query_params['group_ids'] = group_ids
|
||||
|
||||
# PATCHED 2026-05-02 (BirdAI vendored patch): use FalkorDB native vector
|
||||
# index on Community.name_embedding when available. Note: the existing
|
||||
# filter is built into `group_filter_query` (already prefixed with
|
||||
# ' WHERE ' if non-empty) and uses variable `c`. The dispatcher binds
|
||||
# the node as `n` for parity with the helper signature, then we
|
||||
# re-bind to `c` via WITH so the rest of the query is unchanged.
|
||||
use_index = await _falkordb_vector_index_exists(
|
||||
executor, 'Community', 'name_embedding', 'NODE'
|
||||
)
|
||||
prefix, _ = _falkordb_vector_node_search_cypher(
|
||||
'Community', 'name_embedding', '$search_vector', use_index
|
||||
)
|
||||
prefix = prefix + ' WITH n AS c, score '
|
||||
where_clauses = []
|
||||
if group_filter_query:
|
||||
where_clauses.append(group_filter_query.replace(' WHERE ', '', 1).strip())
|
||||
where_clauses.append('score > $min_score')
|
||||
unified_where = ' WHERE ' + ' AND '.join(where_clauses)
|
||||
|
||||
cypher = (
|
||||
prefix
|
||||
+ unified_where
|
||||
+ """
|
||||
RETURN
|
||||
"""
|
||||
+ COMMUNITY_NODE_RETURN
|
||||
+ """
|
||||
ORDER BY score DESC
|
||||
LIMIT $limit
|
||||
"""
|
||||
)
|
||||
params = dict(
|
||||
search_vector=search_vector,
|
||||
limit=limit,
|
||||
min_score=min_score,
|
||||
**query_params,
|
||||
)
|
||||
if use_index:
|
||||
params['candidate_k'] = limit * VECTOR_INDEX_CANDIDATE_MULTIPLIER
|
||||
records, _, _ = await executor.execute_query(cypher, **params)
|
||||
|
||||
return [community_node_from_record(r) for r in records]
|
||||
|
||||
# --- Rerankers ---
|
||||
|
||||
async def node_distance_reranker(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
node_uuids: list[str],
|
||||
center_node_uuid: str,
|
||||
min_score: float = 0,
|
||||
) -> list[EntityNode]:
|
||||
filtered_uuids = [u for u in node_uuids if u != center_node_uuid]
|
||||
scores: dict[str, float] = {center_node_uuid: 0.0}
|
||||
|
||||
cypher = """
|
||||
UNWIND $node_uuids AS node_uuid
|
||||
MATCH (center:Entity {uuid: $center_uuid})-[:RELATES_TO]-(n:Entity {uuid: node_uuid})
|
||||
RETURN 1 AS score, node_uuid AS uuid
|
||||
"""
|
||||
|
||||
results, _, _ = await executor.execute_query(
|
||||
cypher,
|
||||
node_uuids=filtered_uuids,
|
||||
center_uuid=center_node_uuid,
|
||||
)
|
||||
|
||||
for result in results:
|
||||
scores[result['uuid']] = result['score']
|
||||
|
||||
for uuid in filtered_uuids:
|
||||
if uuid not in scores:
|
||||
scores[uuid] = float('inf')
|
||||
|
||||
filtered_uuids.sort(key=lambda cur_uuid: scores[cur_uuid])
|
||||
|
||||
if center_node_uuid in node_uuids:
|
||||
scores[center_node_uuid] = 0.1
|
||||
filtered_uuids = [center_node_uuid] + filtered_uuids
|
||||
|
||||
reranked_uuids = [u for u in filtered_uuids if (1 / scores[u]) >= min_score]
|
||||
|
||||
if not reranked_uuids:
|
||||
return []
|
||||
|
||||
get_query = """
|
||||
MATCH (n:Entity)
|
||||
WHERE n.uuid IN $uuids
|
||||
RETURN
|
||||
""" + get_entity_node_return_query(GraphProvider.FALKORDB)
|
||||
|
||||
records, _, _ = await executor.execute_query(get_query, uuids=reranked_uuids)
|
||||
|
||||
node_map = {r['uuid']: entity_node_from_record(r) for r in records}
|
||||
return [node_map[u] for u in reranked_uuids if u in node_map]
|
||||
|
||||
async def episode_mentions_reranker(
|
||||
self,
|
||||
executor: QueryExecutor,
|
||||
node_uuids: list[str],
|
||||
min_score: float = 0,
|
||||
) -> list[EntityNode]:
|
||||
if not node_uuids:
|
||||
return []
|
||||
|
||||
scores: dict[str, float] = {}
|
||||
|
||||
results, _, _ = await executor.execute_query(
|
||||
"""
|
||||
UNWIND $node_uuids AS node_uuid
|
||||
MATCH (episode:Episodic)-[r:MENTIONS]->(n:Entity {uuid: node_uuid})
|
||||
RETURN count(*) AS score, n.uuid AS uuid
|
||||
""",
|
||||
node_uuids=node_uuids,
|
||||
)
|
||||
|
||||
for result in results:
|
||||
scores[result['uuid']] = result['score']
|
||||
|
||||
for uuid in node_uuids:
|
||||
if uuid not in scores:
|
||||
scores[uuid] = float('inf')
|
||||
|
||||
sorted_uuids = list(node_uuids)
|
||||
sorted_uuids.sort(key=lambda cur_uuid: scores[cur_uuid])
|
||||
|
||||
reranked_uuids = [u for u in sorted_uuids if scores[u] >= min_score]
|
||||
|
||||
if not reranked_uuids:
|
||||
return []
|
||||
|
||||
get_query = """
|
||||
MATCH (n:Entity)
|
||||
WHERE n.uuid IN $uuids
|
||||
RETURN
|
||||
""" + get_entity_node_return_query(GraphProvider.FALKORDB)
|
||||
|
||||
records, _, _ = await executor.execute_query(get_query, uuids=reranked_uuids)
|
||||
|
||||
node_map = {r['uuid']: entity_node_from_record(r) for r in records}
|
||||
return [node_map[u] for u in reranked_uuids if u in node_map]
|
||||
|
||||
# --- Filter builders ---
|
||||
|
||||
def build_node_search_filters(self, search_filters: SearchFilters) -> Any:
|
||||
filter_queries, filter_params = node_search_filter_query_constructor(
|
||||
search_filters, GraphProvider.FALKORDB
|
||||
)
|
||||
return {'filter_queries': filter_queries, 'filter_params': filter_params}
|
||||
|
||||
def build_edge_search_filters(self, search_filters: SearchFilters) -> Any:
|
||||
filter_queries, filter_params = edge_search_filter_query_constructor(
|
||||
search_filters, GraphProvider.FALKORDB
|
||||
)
|
||||
return {'filter_queries': filter_queries, 'filter_params': filter_params}
|
||||
|
||||
# --- Fulltext query builder ---
|
||||
|
||||
def build_fulltext_query(
|
||||
self,
|
||||
query: str,
|
||||
group_ids: list[str] | None = None,
|
||||
max_query_length: int = MAX_QUERY_LENGTH,
|
||||
) -> str:
|
||||
return _build_falkor_fulltext_query(query, group_ids, max_query_length)
|
||||
@@ -0,0 +1,444 @@
|
||||
"""
|
||||
Copyright 2024, Zep Software, Inc.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import datetime
|
||||
import logging
|
||||
from typing import TYPE_CHECKING, Any
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from falkordb import Graph as FalkorGraph
|
||||
from falkordb.asyncio import FalkorDB
|
||||
else:
|
||||
try:
|
||||
from falkordb import Graph as FalkorGraph
|
||||
from falkordb.asyncio import FalkorDB
|
||||
except ImportError:
|
||||
# If falkordb is not installed, raise an ImportError
|
||||
raise ImportError(
|
||||
'falkordb is required for FalkorDriver. '
|
||||
'Install it with: pip install graphiti-core[falkordb]'
|
||||
) from None
|
||||
|
||||
from graphiti_core.driver.driver import GraphDriver, GraphDriverSession, GraphProvider
|
||||
from graphiti_core.driver.falkordb import STOPWORDS as STOPWORDS
|
||||
from graphiti_core.driver.falkordb.operations.community_edge_ops import (
|
||||
FalkorCommunityEdgeOperations,
|
||||
)
|
||||
from graphiti_core.driver.falkordb.operations.community_node_ops import (
|
||||
FalkorCommunityNodeOperations,
|
||||
)
|
||||
from graphiti_core.driver.falkordb.operations.entity_edge_ops import FalkorEntityEdgeOperations
|
||||
from graphiti_core.driver.falkordb.operations.entity_node_ops import FalkorEntityNodeOperations
|
||||
from graphiti_core.driver.falkordb.operations.episode_node_ops import FalkorEpisodeNodeOperations
|
||||
from graphiti_core.driver.falkordb.operations.episodic_edge_ops import FalkorEpisodicEdgeOperations
|
||||
from graphiti_core.driver.falkordb.operations.graph_ops import FalkorGraphMaintenanceOperations
|
||||
from graphiti_core.driver.falkordb.operations.has_episode_edge_ops import (
|
||||
FalkorHasEpisodeEdgeOperations,
|
||||
)
|
||||
from graphiti_core.driver.falkordb.operations.next_episode_edge_ops import (
|
||||
FalkorNextEpisodeEdgeOperations,
|
||||
)
|
||||
from graphiti_core.driver.falkordb.operations.saga_node_ops import FalkorSagaNodeOperations
|
||||
from graphiti_core.driver.falkordb.operations.search_ops import FalkorSearchOperations
|
||||
from graphiti_core.driver.operations.community_edge_ops import CommunityEdgeOperations
|
||||
from graphiti_core.driver.operations.community_node_ops import CommunityNodeOperations
|
||||
from graphiti_core.driver.operations.entity_edge_ops import EntityEdgeOperations
|
||||
from graphiti_core.driver.operations.entity_node_ops import EntityNodeOperations
|
||||
from graphiti_core.driver.operations.episode_node_ops import EpisodeNodeOperations
|
||||
from graphiti_core.driver.operations.episodic_edge_ops import EpisodicEdgeOperations
|
||||
from graphiti_core.driver.operations.graph_ops import GraphMaintenanceOperations
|
||||
from graphiti_core.driver.operations.has_episode_edge_ops import HasEpisodeEdgeOperations
|
||||
from graphiti_core.driver.operations.next_episode_edge_ops import NextEpisodeEdgeOperations
|
||||
from graphiti_core.driver.operations.saga_node_ops import SagaNodeOperations
|
||||
from graphiti_core.driver.operations.search_ops import SearchOperations
|
||||
from graphiti_core.graph_queries import get_fulltext_indices, get_range_indices, get_vector_indices
|
||||
from graphiti_core.helpers import validate_group_ids
|
||||
from graphiti_core.utils.datetime_utils import convert_datetimes_to_strings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class FalkorDriverSession(GraphDriverSession):
|
||||
provider = GraphProvider.FALKORDB
|
||||
|
||||
def __init__(self, graph: FalkorGraph):
|
||||
self.graph = graph
|
||||
|
||||
async def __aenter__(self):
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc, tb):
|
||||
# No cleanup needed for Falkor, but method must exist
|
||||
pass
|
||||
|
||||
async def close(self):
|
||||
# No explicit close needed for FalkorDB, but method must exist
|
||||
pass
|
||||
|
||||
async def execute_write(self, func, *args, **kwargs):
|
||||
# Directly await the provided async function with `self` as the transaction/session
|
||||
return await func(self, *args, **kwargs)
|
||||
|
||||
async def run(self, query: str | list, **kwargs: Any) -> Any:
|
||||
# FalkorDB does not support argument for Label Set, so it's converted into an array of queries
|
||||
if isinstance(query, list):
|
||||
for cypher, params in query:
|
||||
params = convert_datetimes_to_strings(params)
|
||||
await self.graph.query(str(cypher), params) # type: ignore[reportUnknownArgumentType]
|
||||
else:
|
||||
params = dict(kwargs)
|
||||
params = convert_datetimes_to_strings(params)
|
||||
await self.graph.query(str(query), params) # type: ignore[reportUnknownArgumentType]
|
||||
# Assuming `graph.query` is async (ideal); otherwise, wrap in executor
|
||||
return None
|
||||
|
||||
|
||||
class FalkorDriver(GraphDriver):
|
||||
provider = GraphProvider.FALKORDB
|
||||
default_group_id: str = '\\_'
|
||||
fulltext_syntax: str = '@' # FalkorDB uses a redisearch-like syntax for fulltext queries
|
||||
aoss_client: None = None
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
host: str = 'localhost',
|
||||
port: int = 6379,
|
||||
username: str | None = None,
|
||||
password: str | None = None,
|
||||
falkor_db: FalkorDB | None = None,
|
||||
database: str = 'default_db',
|
||||
):
|
||||
"""
|
||||
Initialize the FalkorDB driver.
|
||||
|
||||
FalkorDB is a multi-tenant graph database.
|
||||
To connect, provide the host and port.
|
||||
The default parameters assume a local (on-premises) FalkorDB instance.
|
||||
|
||||
Args:
|
||||
host (str): The host where FalkorDB is running.
|
||||
port (int): The port on which FalkorDB is listening.
|
||||
username (str | None): The username for authentication (if required).
|
||||
password (str | None): The password for authentication (if required).
|
||||
falkor_db (FalkorDB | None): An existing FalkorDB instance to use instead of creating a new one.
|
||||
database (str): The name of the database to connect to. Defaults to 'default_db'.
|
||||
"""
|
||||
super().__init__()
|
||||
self._database = database
|
||||
if falkor_db is not None:
|
||||
# If a FalkorDB instance is provided, use it directly
|
||||
self.client = falkor_db
|
||||
else:
|
||||
self.client = FalkorDB(host=host, port=port, username=username, password=password)
|
||||
|
||||
# Instantiate FalkorDB operations
|
||||
self._entity_node_ops = FalkorEntityNodeOperations()
|
||||
self._episode_node_ops = FalkorEpisodeNodeOperations()
|
||||
self._community_node_ops = FalkorCommunityNodeOperations()
|
||||
self._saga_node_ops = FalkorSagaNodeOperations()
|
||||
self._entity_edge_ops = FalkorEntityEdgeOperations()
|
||||
self._episodic_edge_ops = FalkorEpisodicEdgeOperations()
|
||||
self._community_edge_ops = FalkorCommunityEdgeOperations()
|
||||
self._has_episode_edge_ops = FalkorHasEpisodeEdgeOperations()
|
||||
self._next_episode_edge_ops = FalkorNextEpisodeEdgeOperations()
|
||||
self._search_ops = FalkorSearchOperations()
|
||||
self._graph_ops = FalkorGraphMaintenanceOperations()
|
||||
|
||||
# Schedule the indices and constraints to be built
|
||||
try:
|
||||
# Try to get the current event loop
|
||||
loop = asyncio.get_running_loop()
|
||||
# Schedule the build_indices_and_constraints to run
|
||||
loop.create_task(self.build_indices_and_constraints())
|
||||
except RuntimeError:
|
||||
# No event loop running, this will be handled later
|
||||
pass
|
||||
|
||||
# --- Operations properties ---
|
||||
|
||||
@property
|
||||
def entity_node_ops(self) -> EntityNodeOperations:
|
||||
return self._entity_node_ops
|
||||
|
||||
@property
|
||||
def episode_node_ops(self) -> EpisodeNodeOperations:
|
||||
return self._episode_node_ops
|
||||
|
||||
@property
|
||||
def community_node_ops(self) -> CommunityNodeOperations:
|
||||
return self._community_node_ops
|
||||
|
||||
@property
|
||||
def saga_node_ops(self) -> SagaNodeOperations:
|
||||
return self._saga_node_ops
|
||||
|
||||
@property
|
||||
def entity_edge_ops(self) -> EntityEdgeOperations:
|
||||
return self._entity_edge_ops
|
||||
|
||||
@property
|
||||
def episodic_edge_ops(self) -> EpisodicEdgeOperations:
|
||||
return self._episodic_edge_ops
|
||||
|
||||
@property
|
||||
def community_edge_ops(self) -> CommunityEdgeOperations:
|
||||
return self._community_edge_ops
|
||||
|
||||
@property
|
||||
def has_episode_edge_ops(self) -> HasEpisodeEdgeOperations:
|
||||
return self._has_episode_edge_ops
|
||||
|
||||
@property
|
||||
def next_episode_edge_ops(self) -> NextEpisodeEdgeOperations:
|
||||
return self._next_episode_edge_ops
|
||||
|
||||
@property
|
||||
def search_ops(self) -> SearchOperations:
|
||||
return self._search_ops
|
||||
|
||||
@property
|
||||
def graph_ops(self) -> GraphMaintenanceOperations:
|
||||
return self._graph_ops
|
||||
|
||||
def _get_graph(self, graph_name: str | None) -> FalkorGraph:
|
||||
# FalkorDB requires a non-None database name for multi-tenant graphs; the default is "default_db"
|
||||
if graph_name is None:
|
||||
graph_name = self._database
|
||||
return self.client.select_graph(graph_name)
|
||||
|
||||
async def execute_query(self, cypher_query_, **kwargs: Any):
|
||||
graph = self._get_graph(self._database)
|
||||
|
||||
# Convert datetime objects to ISO strings (FalkorDB does not support datetime objects directly)
|
||||
params = convert_datetimes_to_strings(dict(kwargs))
|
||||
|
||||
try:
|
||||
result = await graph.query(cypher_query_, params) # type: ignore[reportUnknownArgumentType]
|
||||
except Exception as e:
|
||||
if 'already indexed' in str(e):
|
||||
# check if index already exists
|
||||
logger.info(f'Index already exists: {e}')
|
||||
return None
|
||||
logger.error(f'Error executing FalkorDB query: {e}\n{cypher_query_}\n{params}')
|
||||
raise
|
||||
|
||||
# Convert the result header to a list of strings
|
||||
header = [h[1] for h in result.header]
|
||||
|
||||
# Convert FalkorDB's result format (list of lists) to the format expected by Graphiti (list of dicts)
|
||||
records = []
|
||||
for row in result.result_set:
|
||||
record = {}
|
||||
for i, field_name in enumerate(header):
|
||||
if i < len(row):
|
||||
record[field_name] = row[i]
|
||||
else:
|
||||
# If there are more fields in header than values in row, set to None
|
||||
record[field_name] = None
|
||||
records.append(record)
|
||||
|
||||
return records, header, None
|
||||
|
||||
def session(self, database: str | None = None) -> GraphDriverSession:
|
||||
return FalkorDriverSession(self._get_graph(database))
|
||||
|
||||
async def close(self) -> None:
|
||||
"""Close the driver connection."""
|
||||
if hasattr(self.client, 'aclose'):
|
||||
await self.client.aclose() # type: ignore[reportUnknownMemberType]
|
||||
elif hasattr(self.client.connection, 'aclose'):
|
||||
await self.client.connection.aclose()
|
||||
elif hasattr(self.client.connection, 'close'):
|
||||
await self.client.connection.close()
|
||||
|
||||
async def delete_all_indexes(self) -> None:
|
||||
result = await self.execute_query('CALL db.indexes()')
|
||||
if not result:
|
||||
return
|
||||
|
||||
records, _, _ = result
|
||||
drop_tasks = []
|
||||
|
||||
for record in records:
|
||||
label = record['label']
|
||||
entity_type = record['entitytype']
|
||||
|
||||
for field_name, index_type in record['types'].items():
|
||||
if 'RANGE' in index_type:
|
||||
drop_tasks.append(self.execute_query(f'DROP INDEX ON :{label}({field_name})'))
|
||||
elif 'FULLTEXT' in index_type:
|
||||
if entity_type == 'NODE':
|
||||
drop_tasks.append(
|
||||
self.execute_query(
|
||||
f'DROP FULLTEXT INDEX FOR (n:{label}) ON (n.{field_name})'
|
||||
)
|
||||
)
|
||||
elif entity_type == 'RELATIONSHIP':
|
||||
drop_tasks.append(
|
||||
self.execute_query(
|
||||
f'DROP FULLTEXT INDEX FOR ()-[e:{label}]-() ON (e.{field_name})'
|
||||
)
|
||||
)
|
||||
|
||||
if drop_tasks:
|
||||
await asyncio.gather(*drop_tasks)
|
||||
|
||||
async def build_indices_and_constraints(self, delete_existing=False):
|
||||
if delete_existing:
|
||||
await self.delete_all_indexes()
|
||||
# PATCHED 2026-05-02 (BirdAI vendored patch): add vector indexes alongside
|
||||
# range and fulltext. FalkorDB supports native vector indexes via
|
||||
# db.idx.vector.queryNodes / queryRelationships; without these, similarity
|
||||
# search runs as full-table-scan cosine math in interpreted Cypher.
|
||||
index_queries = (
|
||||
get_range_indices(self.provider)
|
||||
+ get_fulltext_indices(self.provider)
|
||||
+ get_vector_indices(self.provider)
|
||||
)
|
||||
for query in index_queries:
|
||||
await self.execute_query(query)
|
||||
# Invalidate the search_ops vector-index existence cache so subsequent
|
||||
# similarity queries re-probe and discover the indexes we just built.
|
||||
try:
|
||||
from graphiti_core.driver.falkordb.operations.search_ops import (
|
||||
_invalidate_falkordb_vector_index_cache,
|
||||
)
|
||||
_invalidate_falkordb_vector_index_cache()
|
||||
except ImportError:
|
||||
# search_ops module not yet imported (cold start); cache is empty
|
||||
# by default, so no invalidation needed.
|
||||
pass
|
||||
|
||||
def clone(self, database: str) -> 'GraphDriver':
|
||||
"""
|
||||
Returns a shallow copy of this driver with a different default database.
|
||||
Reuses the same connection (e.g. FalkorDB, Neo4j).
|
||||
"""
|
||||
if database == self._database:
|
||||
cloned = self
|
||||
elif database == self.default_group_id:
|
||||
cloned = FalkorDriver(falkor_db=self.client)
|
||||
else:
|
||||
# Create a new instance of FalkorDriver with the same connection but a different database
|
||||
cloned = FalkorDriver(falkor_db=self.client, database=database)
|
||||
|
||||
return cloned
|
||||
|
||||
async def health_check(self) -> None:
|
||||
"""Check FalkorDB connectivity by running a simple query."""
|
||||
try:
|
||||
await self.execute_query('MATCH (n) RETURN 1 LIMIT 1')
|
||||
return None
|
||||
except Exception as e:
|
||||
print(f'FalkorDB health check failed: {e}')
|
||||
raise
|
||||
|
||||
@staticmethod
|
||||
def convert_datetimes_to_strings(obj):
|
||||
if isinstance(obj, dict):
|
||||
return {k: FalkorDriver.convert_datetimes_to_strings(v) for k, v in obj.items()}
|
||||
elif isinstance(obj, list):
|
||||
return [FalkorDriver.convert_datetimes_to_strings(item) for item in obj]
|
||||
elif isinstance(obj, tuple):
|
||||
return tuple(FalkorDriver.convert_datetimes_to_strings(item) for item in obj)
|
||||
elif isinstance(obj, datetime):
|
||||
return obj.isoformat()
|
||||
else:
|
||||
return obj
|
||||
|
||||
def sanitize(self, query: str) -> str:
|
||||
"""
|
||||
Replace FalkorDB special characters with whitespace.
|
||||
Based on FalkorDB tokenization rules: ,.<>{}[]"':;!@#$%^&*()-+=~
|
||||
"""
|
||||
# FalkorDB separator characters that break text into tokens
|
||||
separator_map = str.maketrans(
|
||||
{
|
||||
',': ' ',
|
||||
'.': ' ',
|
||||
'<': ' ',
|
||||
'>': ' ',
|
||||
'{': ' ',
|
||||
'}': ' ',
|
||||
'[': ' ',
|
||||
']': ' ',
|
||||
'"': ' ',
|
||||
"'": ' ',
|
||||
':': ' ',
|
||||
';': ' ',
|
||||
'!': ' ',
|
||||
'@': ' ',
|
||||
'#': ' ',
|
||||
'$': ' ',
|
||||
'%': ' ',
|
||||
'^': ' ',
|
||||
'&': ' ',
|
||||
'*': ' ',
|
||||
'(': ' ',
|
||||
')': ' ',
|
||||
'-': ' ',
|
||||
'+': ' ',
|
||||
'=': ' ',
|
||||
'~': ' ',
|
||||
'?': ' ',
|
||||
'|': ' ',
|
||||
'/': ' ',
|
||||
'\\': ' ',
|
||||
}
|
||||
)
|
||||
sanitized = query.translate(separator_map)
|
||||
# Clean up multiple spaces
|
||||
sanitized = ' '.join(sanitized.split())
|
||||
return sanitized
|
||||
|
||||
def build_fulltext_query(
|
||||
self, query: str, group_ids: list[str] | None = None, max_query_length: int = 128
|
||||
) -> str:
|
||||
"""
|
||||
Build a fulltext query string for FalkorDB using RedisSearch syntax.
|
||||
FalkorDB uses RedisSearch-like syntax where:
|
||||
- Field queries use @ prefix: @field:value
|
||||
- Multiple values for same field: (@field:value1|value2)
|
||||
- Text search doesn't need @ prefix for content fields
|
||||
- AND is implicit with space: (@group_id:value) (text)
|
||||
- OR uses pipe within parentheses: (@group_id:value1|value2)
|
||||
"""
|
||||
validate_group_ids(group_ids)
|
||||
|
||||
if group_ids is None or len(group_ids) == 0:
|
||||
group_filter = ''
|
||||
else:
|
||||
# Escape group_ids with quotes to prevent RediSearch syntax errors
|
||||
# with reserved words like "main" or special characters like hyphens
|
||||
escaped_group_ids = [f'"{gid}"' for gid in group_ids]
|
||||
group_values = '|'.join(escaped_group_ids)
|
||||
group_filter = f'(@group_id:{group_values})'
|
||||
|
||||
sanitized_query = self.sanitize(query)
|
||||
|
||||
# Remove stopwords and empty tokens from the sanitized query
|
||||
query_words = sanitized_query.split()
|
||||
filtered_words = [word for word in query_words if word and word.lower() not in STOPWORDS]
|
||||
sanitized_query = ' | '.join(filtered_words)
|
||||
|
||||
# If the query is too long return no query
|
||||
if len(sanitized_query.split(' ')) + len(group_ids or '') >= max_query_length:
|
||||
return ''
|
||||
|
||||
full_query = group_filter + ' (' + sanitized_query + ')'
|
||||
|
||||
return full_query
|
||||
@@ -0,0 +1,242 @@
|
||||
"""
|
||||
Database query utilities for different graph database backends.
|
||||
|
||||
This module provides database-agnostic query generation for Neo4j and FalkorDB,
|
||||
supporting index creation, fulltext search, and bulk operations.
|
||||
|
||||
PATCHED for FalkorDB native vector index support (BirdAI vendored patch,
|
||||
2026-05-02). Adds:
|
||||
- get_vector_indices(): CREATE VECTOR INDEX statements for FalkorDB
|
||||
- get_vector_search_query(): Cypher fragment for vector similarity using
|
||||
FalkorDB's db.idx.vector procedures, with fallback to cosine math when
|
||||
the index does not yet exist
|
||||
- VECTOR_INDEX_CANDIDATE_MULTIPLIER: over-fetch factor for vector index
|
||||
queries to handle filter rejections after index lookup
|
||||
|
||||
No changes to Neo4j or Kuzu code paths.
|
||||
"""
|
||||
|
||||
from typing_extensions import LiteralString
|
||||
|
||||
from graphiti_core.driver.driver import GraphProvider
|
||||
|
||||
# Mapping from Neo4j fulltext index names to FalkorDB node labels
|
||||
NEO4J_TO_FALKORDB_MAPPING = {
|
||||
'node_name_and_summary': 'Entity',
|
||||
'community_name': 'Community',
|
||||
'episode_content': 'Episodic',
|
||||
'edge_name_and_fact': 'RELATES_TO',
|
||||
}
|
||||
# Mapping from fulltext index names to Kuzu node labels
|
||||
INDEX_TO_LABEL_KUZU_MAPPING = {
|
||||
'node_name_and_summary': 'Entity',
|
||||
'community_name': 'Community',
|
||||
'episode_content': 'Episodic',
|
||||
'edge_name_and_fact': 'RelatesToNode_',
|
||||
}
|
||||
|
||||
# Vector index over-fetch multiplier. When a vector index search is
|
||||
# combined with WHERE filters (group_id, source_uuid, etc.), some of
|
||||
# the top-k index results may be filtered out. Over-fetching by this
|
||||
# factor preserves recall against the final LIMIT after filtering.
|
||||
# Conservative default; tunable per-deployment by editing this constant
|
||||
# or via environment-variable override at the driver level (future).
|
||||
VECTOR_INDEX_CANDIDATE_MULTIPLIER = 5
|
||||
|
||||
|
||||
def get_range_indices(provider: GraphProvider) -> list[LiteralString]:
|
||||
if provider == GraphProvider.FALKORDB:
|
||||
return [
|
||||
# Entity node
|
||||
'CREATE INDEX FOR (n:Entity) ON (n.uuid, n.group_id, n.name, n.created_at)',
|
||||
# Episodic node
|
||||
'CREATE INDEX FOR (n:Episodic) ON (n.uuid, n.group_id, n.created_at, n.valid_at)',
|
||||
# Community node
|
||||
'CREATE INDEX FOR (n:Community) ON (n.uuid)',
|
||||
# Saga node
|
||||
'CREATE INDEX FOR (n:Saga) ON (n.uuid, n.group_id, n.name)',
|
||||
# RELATES_TO edge
|
||||
'CREATE INDEX FOR ()-[e:RELATES_TO]-() ON (e.uuid, e.group_id, e.name, e.created_at, e.expired_at, e.valid_at, e.invalid_at)',
|
||||
# MENTIONS edge
|
||||
'CREATE INDEX FOR ()-[e:MENTIONS]-() ON (e.uuid, e.group_id)',
|
||||
# HAS_MEMBER edge
|
||||
'CREATE INDEX FOR ()-[e:HAS_MEMBER]-() ON (e.uuid)',
|
||||
# HAS_EPISODE edge
|
||||
'CREATE INDEX FOR ()-[e:HAS_EPISODE]-() ON (e.uuid, e.group_id)',
|
||||
# NEXT_EPISODE edge
|
||||
'CREATE INDEX FOR ()-[e:NEXT_EPISODE]-() ON (e.uuid, e.group_id)',
|
||||
]
|
||||
|
||||
if provider == GraphProvider.KUZU:
|
||||
return []
|
||||
|
||||
return [
|
||||
'CREATE INDEX entity_uuid IF NOT EXISTS FOR (n:Entity) ON (n.uuid)',
|
||||
'CREATE INDEX episode_uuid IF NOT EXISTS FOR (n:Episodic) ON (n.uuid)',
|
||||
'CREATE INDEX community_uuid IF NOT EXISTS FOR (n:Community) ON (n.uuid)',
|
||||
'CREATE INDEX saga_uuid IF NOT EXISTS FOR (n:Saga) ON (n.uuid)',
|
||||
'CREATE INDEX relation_uuid IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.uuid)',
|
||||
'CREATE INDEX mention_uuid IF NOT EXISTS FOR ()-[e:MENTIONS]-() ON (e.uuid)',
|
||||
'CREATE INDEX has_member_uuid IF NOT EXISTS FOR ()-[e:HAS_MEMBER]-() ON (e.uuid)',
|
||||
'CREATE INDEX has_episode_uuid IF NOT EXISTS FOR ()-[e:HAS_EPISODE]-() ON (e.uuid)',
|
||||
'CREATE INDEX next_episode_uuid IF NOT EXISTS FOR ()-[e:NEXT_EPISODE]-() ON (e.uuid)',
|
||||
'CREATE INDEX entity_group_id IF NOT EXISTS FOR (n:Entity) ON (n.group_id)',
|
||||
'CREATE INDEX episode_group_id IF NOT EXISTS FOR (n:Episodic) ON (n.group_id)',
|
||||
'CREATE INDEX community_group_id IF NOT EXISTS FOR (n:Community) ON (n.group_id)',
|
||||
'CREATE INDEX saga_group_id IF NOT EXISTS FOR (n:Saga) ON (n.group_id)',
|
||||
'CREATE INDEX relation_group_id IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.group_id)',
|
||||
'CREATE INDEX mention_group_id IF NOT EXISTS FOR ()-[e:MENTIONS]-() ON (e.group_id)',
|
||||
'CREATE INDEX has_episode_group_id IF NOT EXISTS FOR ()-[e:HAS_EPISODE]-() ON (e.group_id)',
|
||||
'CREATE INDEX next_episode_group_id IF NOT EXISTS FOR ()-[e:NEXT_EPISODE]-() ON (e.group_id)',
|
||||
'CREATE INDEX name_entity_index IF NOT EXISTS FOR (n:Entity) ON (n.name)',
|
||||
'CREATE INDEX saga_name IF NOT EXISTS FOR (n:Saga) ON (n.name)',
|
||||
'CREATE INDEX created_at_entity_index IF NOT EXISTS FOR (n:Entity) ON (n.created_at)',
|
||||
'CREATE INDEX created_at_episodic_index IF NOT EXISTS FOR (n:Episodic) ON (n.created_at)',
|
||||
'CREATE INDEX valid_at_episodic_index IF NOT EXISTS FOR (n:Episodic) ON (n.valid_at)',
|
||||
'CREATE INDEX name_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.name)',
|
||||
'CREATE INDEX created_at_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.created_at)',
|
||||
'CREATE INDEX expired_at_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.expired_at)',
|
||||
'CREATE INDEX valid_at_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.valid_at)',
|
||||
'CREATE INDEX invalid_at_edge_index IF NOT EXISTS FOR ()-[e:RELATES_TO]-() ON (e.invalid_at)',
|
||||
]
|
||||
|
||||
|
||||
def get_fulltext_indices(provider: GraphProvider) -> list[LiteralString]:
|
||||
if provider == GraphProvider.FALKORDB:
|
||||
from typing import cast
|
||||
|
||||
from graphiti_core.driver.falkordb import STOPWORDS
|
||||
|
||||
# Convert to string representation for embedding in queries
|
||||
stopwords_str = str(STOPWORDS)
|
||||
|
||||
# Use type: ignore to satisfy LiteralString requirement while maintaining single source of truth
|
||||
return cast(
|
||||
list[LiteralString],
|
||||
[
|
||||
f"""CALL db.idx.fulltext.createNodeIndex(
|
||||
{{
|
||||
label: 'Episodic',
|
||||
stopwords: {stopwords_str}
|
||||
}},
|
||||
'content', 'source', 'source_description', 'group_id'
|
||||
)""",
|
||||
f"""CALL db.idx.fulltext.createNodeIndex(
|
||||
{{
|
||||
label: 'Entity',
|
||||
stopwords: {stopwords_str}
|
||||
}},
|
||||
'name', 'summary', 'group_id'
|
||||
)""",
|
||||
f"""CALL db.idx.fulltext.createNodeIndex(
|
||||
{{
|
||||
label: 'Community',
|
||||
stopwords: {stopwords_str}
|
||||
}},
|
||||
'name', 'group_id'
|
||||
)""",
|
||||
"""CREATE FULLTEXT INDEX FOR ()-[e:RELATES_TO]-() ON (e.name, e.fact, e.group_id)""",
|
||||
],
|
||||
)
|
||||
|
||||
if provider == GraphProvider.KUZU:
|
||||
return [
|
||||
"CALL CREATE_FTS_INDEX('Episodic', 'episode_content', ['content', 'source', 'source_description']);",
|
||||
"CALL CREATE_FTS_INDEX('Entity', 'node_name_and_summary', ['name', 'summary']);",
|
||||
"CALL CREATE_FTS_INDEX('Community', 'community_name', ['name']);",
|
||||
"CALL CREATE_FTS_INDEX('RelatesToNode_', 'edge_name_and_fact', ['name', 'fact']);",
|
||||
]
|
||||
|
||||
return [
|
||||
"""CREATE FULLTEXT INDEX episode_content IF NOT EXISTS
|
||||
FOR (e:Episodic) ON EACH [e.content, e.source, e.source_description, e.group_id]""",
|
||||
"""CREATE FULLTEXT INDEX node_name_and_summary IF NOT EXISTS
|
||||
FOR (n:Entity) ON EACH [n.name, n.summary, n.group_id]""",
|
||||
"""CREATE FULLTEXT INDEX community_name IF NOT EXISTS
|
||||
FOR (n:Community) ON EACH [n.name, n.group_id]""",
|
||||
"""CREATE FULLTEXT INDEX edge_name_and_fact IF NOT EXISTS
|
||||
FOR ()-[e:RELATES_TO]-() ON EACH [e.name, e.fact, e.group_id]""",
|
||||
]
|
||||
|
||||
|
||||
def get_vector_indices(provider: GraphProvider, dimension: int = 384) -> list[LiteralString]:
|
||||
"""Return CREATE VECTOR INDEX statements for the given provider.
|
||||
|
||||
For FalkorDB: creates HNSW vector indexes on Entity.name_embedding,
|
||||
RELATES_TO.fact_embedding, and Community.name_embedding. Backed by
|
||||
FalkorDB's native vector index (db.idx.vector.queryNodes /
|
||||
queryRelationships).
|
||||
|
||||
For Neo4j and Kuzu: returns an empty list. Those backends create vector
|
||||
indexes via different mechanisms (Neo4j auto-creates them when needed
|
||||
via its vector.similarity.cosine function; Kuzu uses array_cosine_similarity
|
||||
and does not require pre-built vector indexes for graphiti-core's usage).
|
||||
|
||||
Args:
|
||||
provider: The graph database provider.
|
||||
dimension: Embedding dimension. Defaults to 384 (all-MiniLM-L6-v2).
|
||||
Embedders with different dimensions should pass their own value
|
||||
through driver configuration. graphiti-core's default embedder
|
||||
is 1536 (OpenAI ada-002); BirdAI uses 384 (sentence-transformers).
|
||||
|
||||
Returns:
|
||||
List of CREATE VECTOR INDEX statements. Idempotent at FalkorDB level
|
||||
if the index already exists with matching options.
|
||||
"""
|
||||
if provider == GraphProvider.FALKORDB:
|
||||
from typing import cast
|
||||
return cast(
|
||||
list[LiteralString],
|
||||
[
|
||||
f"CREATE VECTOR INDEX FOR (n:Entity) ON (n.name_embedding) "
|
||||
f"OPTIONS {{dimension: {dimension}, similarityFunction: 'cosine'}}",
|
||||
f"CREATE VECTOR INDEX FOR ()-[e:RELATES_TO]-() ON (e.fact_embedding) "
|
||||
f"OPTIONS {{dimension: {dimension}, similarityFunction: 'cosine'}}",
|
||||
f"CREATE VECTOR INDEX FOR (n:Community) ON (n.name_embedding) "
|
||||
f"OPTIONS {{dimension: {dimension}, similarityFunction: 'cosine'}}",
|
||||
],
|
||||
)
|
||||
|
||||
return []
|
||||
|
||||
|
||||
def get_nodes_query(name: str, query: str, limit: int, provider: GraphProvider) -> str:
|
||||
if provider == GraphProvider.FALKORDB:
|
||||
label = NEO4J_TO_FALKORDB_MAPPING[name]
|
||||
return f"CALL db.idx.fulltext.queryNodes('{label}', {query})"
|
||||
|
||||
if provider == GraphProvider.KUZU:
|
||||
label = INDEX_TO_LABEL_KUZU_MAPPING[name]
|
||||
return f"CALL QUERY_FTS_INDEX('{label}', '{name}', {query}, TOP := $limit)"
|
||||
|
||||
return f'CALL db.index.fulltext.queryNodes("{name}", {query}, {{limit: $limit}})'
|
||||
|
||||
|
||||
def get_vector_cosine_func_query(vec1, vec2, provider: GraphProvider) -> str:
|
||||
"""Return a Cypher fragment for cosine similarity score in [0, 1].
|
||||
|
||||
PRESERVED for backward compatibility and as fallback when vector indexes
|
||||
do not yet exist on the FalkorDB backend. New code paths should prefer
|
||||
get_vector_search_query() which uses the native vector index when
|
||||
available.
|
||||
"""
|
||||
if provider == GraphProvider.FALKORDB:
|
||||
# FalkorDB uses a different syntax for regular cosine similarity and Neo4j uses normalized cosine similarity
|
||||
return f'(2 - vec.cosineDistance({vec1}, vecf32({vec2})))/2'
|
||||
|
||||
if provider == GraphProvider.KUZU:
|
||||
return f'array_cosine_similarity({vec1}, {vec2})'
|
||||
|
||||
return f'vector.similarity.cosine({vec1}, {vec2})'
|
||||
|
||||
|
||||
def get_relationships_query(name: str, limit: int, provider: GraphProvider) -> str:
|
||||
if provider == GraphProvider.FALKORDB:
|
||||
label = NEO4J_TO_FALKORDB_MAPPING[name]
|
||||
return f"CALL db.idx.fulltext.queryRelationships('{label}', $query)"
|
||||
|
||||
if provider == GraphProvider.KUZU:
|
||||
label = INDEX_TO_LABEL_KUZU_MAPPING[name]
|
||||
return f"CALL QUERY_FTS_INDEX('{label}', '{name}', cast($query AS STRING), TOP := $limit)"
|
||||
|
||||
return f'CALL db.index.fulltext.queryRelationships("{name}", $query, {{limit: $limit}})'
|
||||
@@ -0,0 +1,12 @@
|
||||
[
|
||||
"Berube Independent Study Form.pdf",
|
||||
"Aaron Nelson - Student Work.pdf",
|
||||
"3dCOMp.pdf",
|
||||
"Claude: Preparing for dinner with Jim Agutter",
|
||||
"Annual Report - 2020.pdf",
|
||||
"Wearable Marquees uw4.pptx",
|
||||
"ChatGPT: Movie Quote Clarification",
|
||||
"Mod07_Insight_2023.pptx",
|
||||
"CAD I Syllabus.docx",
|
||||
"ChatGPT: RMA armor discount codes"
|
||||
]
|
||||
+869
-139
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,128 @@
|
||||
"""One-off: backfill last_consolidated_at + consolidation_count on embeddings
|
||||
from the dream-manifest-*.json files already in Journal/Dreams/.
|
||||
|
||||
Why this exists: the consolidation cursor columns added by the dreamer
|
||||
redesign migration default to NULL / 0. Without history, the
|
||||
underprocessed-count signal in dream_observation.observe_corpus() reports
|
||||
"every chunk is underprocessed" (degenerate percentile), and NREM has no
|
||||
basis to bias replay toward least-recently-consolidated chunks.
|
||||
|
||||
We have ~25 historical dream manifests in Nextcloud/Journal/Dreams/, each
|
||||
listing the sources retrieved per stage. For each (manifest, source) pair
|
||||
this script:
|
||||
- finds matching embeddings rows by source (basename match)
|
||||
- increments consolidation_count by 1
|
||||
- updates last_consolidated_at to the manifest date (UTC midnight)
|
||||
|
||||
Idempotent: re-running will not double-count because we drop existing
|
||||
cursor values to NULL/0 before backfilling. Pass --dry-run to print what
|
||||
would change without writing.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
from dotenv import load_dotenv
|
||||
import psycopg2
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
DREAMS_DIR = Path("/home/aaron/nextcloud/data/data/aaron/files/Journal/Dreams")
|
||||
DRY_RUN = "--dry-run" in sys.argv
|
||||
|
||||
|
||||
def get_pg():
|
||||
return psycopg2.connect(PG_DSN)
|
||||
|
||||
|
||||
def collect_manifest_records():
|
||||
"""Return a list of (source_basename, manifest_date_utc) tuples from all
|
||||
dream-manifest-*.json files. One pair per (manifest, source) appearance."""
|
||||
pairs = []
|
||||
if not DREAMS_DIR.exists():
|
||||
return pairs
|
||||
for path in sorted(DREAMS_DIR.glob("dream-manifest-*.json")):
|
||||
try:
|
||||
m = json.loads(path.read_text())
|
||||
except Exception as e:
|
||||
print(f" skip {path.name}: {e}")
|
||||
continue
|
||||
date_str = m.get("date")
|
||||
if not date_str:
|
||||
continue
|
||||
try:
|
||||
dt = datetime.fromisoformat(date_str).replace(tzinfo=timezone.utc)
|
||||
except ValueError:
|
||||
continue
|
||||
stages = m.get("stages") or {}
|
||||
for stage_name in ("nrem", "early_rem", "late_rem", "synthesis"):
|
||||
stage = stages.get(stage_name) or {}
|
||||
for src in (stage.get("sources") or []):
|
||||
if src:
|
||||
pairs.append((src, dt))
|
||||
return pairs
|
||||
|
||||
|
||||
def main():
|
||||
print(f"Mode: {'DRY-RUN' if DRY_RUN else 'APPLY'}")
|
||||
print(f"Scanning manifests in {DREAMS_DIR}")
|
||||
pairs = collect_manifest_records()
|
||||
print(f"Collected {len(pairs)} (source, manifest_date) pairs across all manifests")
|
||||
if not pairs:
|
||||
print("Nothing to backfill.")
|
||||
return
|
||||
|
||||
# Aggregate per source: count + latest date
|
||||
from collections import defaultdict
|
||||
counts = defaultdict(int)
|
||||
latest = {}
|
||||
for src, dt in pairs:
|
||||
counts[src] += 1
|
||||
if src not in latest or dt > latest[src]:
|
||||
latest[src] = dt
|
||||
print(f"Unique sources to update: {len(counts)}")
|
||||
|
||||
# Sample what we'd write
|
||||
print("Sample (top 5 by appearance count):")
|
||||
for src, n in sorted(counts.items(), key=lambda kv: -kv[1])[:5]:
|
||||
print(f" {n:>3} appearances — {src} → last_consolidated_at = {latest[src].date()}")
|
||||
|
||||
if DRY_RUN:
|
||||
print("\nDry-run only. Re-run without --dry-run to apply.")
|
||||
return
|
||||
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
|
||||
# Reset cursor for any sources we're about to backfill so reruns are clean.
|
||||
print("\nResetting cursor for sources we'll touch...")
|
||||
sources = list(counts.keys())
|
||||
cur.execute(
|
||||
"UPDATE embeddings SET last_consolidated_at = NULL, consolidation_count = 0 "
|
||||
"WHERE source = ANY(%s)",
|
||||
(sources,),
|
||||
)
|
||||
print(f" reset {cur.rowcount} embeddings rows")
|
||||
|
||||
# Apply per-source updates. For each source, set count and latest date.
|
||||
print("Applying per-source backfill...")
|
||||
updated_rows = 0
|
||||
for src, n in counts.items():
|
||||
cur.execute(
|
||||
"UPDATE embeddings "
|
||||
"SET consolidation_count = %s, last_consolidated_at = %s "
|
||||
"WHERE source = %s",
|
||||
(n, latest[src], src),
|
||||
)
|
||||
updated_rows += cur.rowcount
|
||||
pg.commit()
|
||||
pg.close()
|
||||
print(f"Done. Updated {updated_rows} embeddings rows across {len(counts)} unique sources.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
+1
-1
@@ -6,7 +6,7 @@ mkdir -p "$BACKUP_DIR"
|
||||
# Copy critical files
|
||||
cp ~/aaronai/memory.md "$BACKUP_DIR/memory-$DATE.md"
|
||||
cp ~/aaronai/settings.json "$BACKUP_DIR/settings-$DATE.json"
|
||||
cp ~/aaronai/conversations.db "$BACKUP_DIR/conversations-$DATE.db"
|
||||
python3 -c "import sqlite3, sys; src = sqlite3.connect('$HOME/aaronai/conversations.db'); dst = sqlite3.connect('$BACKUP_DIR/conversations-$DATE.db'); src.backup(dst); dst.close(); src.close()"
|
||||
|
||||
# Keep only last 7 days
|
||||
find "$BACKUP_DIR" -name "*.md" -mtime +7 -delete
|
||||
|
||||
@@ -0,0 +1,226 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
corpus_integrity.py — BirdAI Corpus Integrity Check
|
||||
|
||||
Compares three sources of truth:
|
||||
1. Filesystem (Nextcloud) — what files exist
|
||||
2. pgvector (embeddings table) — what's been through Stage 1
|
||||
3. Graphiti (migration state + stage_3_queue) — what's been through Stage 3
|
||||
|
||||
Usage:
|
||||
python3 corpus_integrity.py # report only
|
||||
python3 corpus_integrity.py --fix # report + auto-queue gaps for retry
|
||||
python3 corpus_integrity.py --json # output JSON to stdout
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
|
||||
import psycopg2
|
||||
from dotenv import load_dotenv
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from encoding import extract_text
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
NEXTCLOUD_PATH = "/home/aaron/nextcloud/data/data/aaron/files"
|
||||
MIGRATION_STATE = str(Path.home() / "aaronai" / "experiments" / "tier1_migration_state.json")
|
||||
REPORT_PATH = str(Path.home() / "aaronai" / "corpus_integrity_report.json")
|
||||
SUPPORTED = {".pdf", ".docx", ".pptx", ".txt", ".md"}
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
|
||||
|
||||
def get_pg():
|
||||
return psycopg2.connect(PG_DSN)
|
||||
|
||||
|
||||
def get_filesystem_files():
|
||||
files = []
|
||||
root = Path(NEXTCLOUD_PATH)
|
||||
for path in root.rglob("*"):
|
||||
if path.is_dir(): continue
|
||||
if path.suffix.lower() not in SUPPORTED: continue
|
||||
if path.name.startswith((".", "~$")): continue
|
||||
if "Admin/Backups" in str(path) or "Backups" in path.parts: continue
|
||||
if "Journal/Media" in str(path): continue
|
||||
files.append({"source": path.name, "filepath": str(path),
|
||||
"size": path.stat().st_size, "mtime": path.stat().st_mtime})
|
||||
return files
|
||||
|
||||
|
||||
def get_pgvector_sources():
|
||||
try:
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute("SELECT DISTINCT source FROM embeddings WHERE source IS NOT NULL")
|
||||
sources = {row[0] for row in cur.fetchall()}
|
||||
pg.close()
|
||||
return sources
|
||||
except Exception as e:
|
||||
print(f"ERROR: pgvector: {e}", file=sys.stderr)
|
||||
return set()
|
||||
|
||||
|
||||
def get_graphiti_sources():
|
||||
sources = set()
|
||||
try:
|
||||
state_path = Path(MIGRATION_STATE)
|
||||
if state_path.exists():
|
||||
state = json.loads(state_path.read_text())
|
||||
for filepath in state.get("ingested", []):
|
||||
sources.add(Path(filepath).name)
|
||||
except Exception as e:
|
||||
print(f"WARNING: migration state: {e}", file=sys.stderr)
|
||||
try:
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute("SELECT DISTINCT source FROM stage_3_queue WHERE completed_at IS NOT NULL")
|
||||
for row in cur.fetchall(): sources.add(row[0])
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
print(f"WARNING: stage_3_queue: {e}", file=sys.stderr)
|
||||
return sources
|
||||
|
||||
|
||||
def get_ingest_failures():
|
||||
failures = {}
|
||||
try:
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute("""
|
||||
SELECT source, filepath, error, retry_count, first_failed_at, last_failed_at
|
||||
FROM ingest_failures WHERE resolved = FALSE ORDER BY last_failed_at DESC
|
||||
""")
|
||||
for row in cur.fetchall():
|
||||
failures[row[0]] = {"source": row[0], "filepath": row[1], "error": row[2],
|
||||
"retry_count": row[3], "first_failed_at": str(row[4]),
|
||||
"last_failed_at": str(row[5])}
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
print(f"WARNING: ingest_failures: {e}", file=sys.stderr)
|
||||
return failures
|
||||
|
||||
|
||||
def queue_for_retry(source, full_text, filepath):
|
||||
try:
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute("""
|
||||
INSERT INTO stage_2_queue (source, full_text, char_length)
|
||||
VALUES (%s, %s, %s)
|
||||
ON CONFLICT (source) DO UPDATE SET
|
||||
full_text = EXCLUDED.full_text, char_length = EXCLUDED.char_length,
|
||||
enqueued_at = NOW(), completed_at = NULL, failed_at = NULL, attempts = 0
|
||||
""", (source, full_text, len(full_text)))
|
||||
pg.commit()
|
||||
pg.close()
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"WARNING: queue failed {source}: {e}", file=sys.stderr)
|
||||
return False
|
||||
|
||||
|
||||
def run_reconciliation(fix=False):
|
||||
print(f"BirdAI Corpus Integrity Check — {datetime.now().isoformat()}")
|
||||
print()
|
||||
print("Scanning filesystem...")
|
||||
fs_files = get_filesystem_files()
|
||||
fs_sources = {f["source"]: f for f in fs_files}
|
||||
print(f" Filesystem: {len(fs_files)} files")
|
||||
print("Querying pgvector...")
|
||||
pv_sources = get_pgvector_sources()
|
||||
print(f" pgvector: {len(pv_sources)} distinct sources")
|
||||
print("Querying Graphiti...")
|
||||
gr_sources = get_graphiti_sources()
|
||||
print(f" Graphiti: {len(gr_sources)} sources")
|
||||
print("Querying ingest failures...")
|
||||
failures = get_ingest_failures()
|
||||
print(f" Failures: {len(failures)} unresolved")
|
||||
print()
|
||||
|
||||
both, pv_only, neither, gr_only = [], [], [], []
|
||||
for source, finfo in fs_sources.items():
|
||||
in_pv = source in pv_sources
|
||||
in_gr = source in gr_sources
|
||||
if in_pv and in_gr: both.append(finfo)
|
||||
elif in_pv: pv_only.append(finfo)
|
||||
elif in_gr: gr_only.append(finfo)
|
||||
else: neither.append(finfo)
|
||||
|
||||
orphans_pv = pv_sources - set(fs_sources.keys())
|
||||
orphans_gr = gr_sources - set(fs_sources.keys())
|
||||
|
||||
print(f"Results:")
|
||||
print(f" Both (pgvector + Graphiti): {len(both)}")
|
||||
print(f" pgvector only: {len(pv_only)}")
|
||||
print(f" Neither (corpus gap): {len(neither)}")
|
||||
print(f" Graphiti only: {len(gr_only)}")
|
||||
print(f" Ingest failures: {len(failures)}")
|
||||
print(f" pgvector orphans: {len(orphans_pv)}")
|
||||
print(f" Graphiti orphans: {len(orphans_gr)}")
|
||||
print()
|
||||
|
||||
auto_queued = []
|
||||
if fix and neither:
|
||||
print(f"Auto-queuing {len(neither)} gap files...")
|
||||
for finfo in neither:
|
||||
text = extract_text(Path(finfo["filepath"]))
|
||||
if text.strip():
|
||||
if queue_for_retry(finfo["source"], text, finfo["filepath"]):
|
||||
auto_queued.append(finfo["source"])
|
||||
print(f" Queued: {finfo['source']}")
|
||||
else:
|
||||
print(f" Skipped (unreadable): {finfo['source']}")
|
||||
try:
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute("""
|
||||
INSERT INTO ingest_failures (source, filepath, error, retry_count, first_failed_at, last_failed_at)
|
||||
VALUES (%s, %s, %s, 0, NOW(), NOW())
|
||||
ON CONFLICT (source) DO UPDATE SET
|
||||
error = EXCLUDED.error,
|
||||
last_failed_at = NOW()
|
||||
""", (finfo["source"], finfo["filepath"],
|
||||
"Empty text — likely scanned, encrypted, or corrupt. Requires manual review or OCR."))
|
||||
pg.commit()
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
print(f" WARNING: could not record failure: {e}")
|
||||
print()
|
||||
|
||||
report = {
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"summary": {
|
||||
"filesystem_total": len(fs_files), "pgvector_total": len(pv_sources),
|
||||
"graphiti_total": len(gr_sources), "both": len(both),
|
||||
"pgvector_only": len(pv_only), "neither": len(neither),
|
||||
"graphiti_only": len(gr_only), "failures": len(failures),
|
||||
"orphans_pgvector": len(orphans_pv), "orphans_graphiti": len(orphans_gr),
|
||||
},
|
||||
"gaps": [f["source"] for f in neither],
|
||||
"failures": list(failures.values()),
|
||||
"auto_queued": auto_queued,
|
||||
"pgvector_only_sample": [f["source"] for f in pv_only[:20]],
|
||||
"graphiti_only": list(gr_only),
|
||||
}
|
||||
Path(REPORT_PATH).write_text(json.dumps(report, indent=2))
|
||||
print(f"Report written to: {REPORT_PATH}")
|
||||
return report
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--fix", action="store_true")
|
||||
parser.add_argument("--json", action="store_true")
|
||||
args = parser.parse_args()
|
||||
report = run_reconciliation(fix=args.fix)
|
||||
if args.json:
|
||||
print(json.dumps(report, indent=2))
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,551 @@
|
||||
"""
|
||||
Consolidator 0.1 — alias resolution agent for BirdAI's Tier 1 substrate.
|
||||
|
||||
Reads entities from FalkorDB group_id 'aaron', infers light type labels,
|
||||
computes pairwise similarity within type blocks using ego summary embedding +
|
||||
name string distance + neighbor pattern overlap, generates merge proposals
|
||||
above threshold, writes proposal log for human review.
|
||||
|
||||
Does NOT execute merges. 0.1 is the calibration phase — proposals only,
|
||||
human reviews before any action.
|
||||
"""
|
||||
import json
|
||||
import re
|
||||
import os
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
from falkordb import FalkorDB
|
||||
import numpy as np
|
||||
|
||||
# Configuration
|
||||
GROUP_ID = "aaron"
|
||||
HIGH_CONFIDENCE_THRESHOLD = 0.85 # propose merge above this
|
||||
LOW_CONFIDENCE_THRESHOLD = 0.65 # log as low-confidence below
|
||||
PROPOSALS_DIR = Path("/home/aaron/Nextcloud/Journal/Consolidation")
|
||||
PROPOSALS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
|
||||
def cosine_similarity(a, b):
|
||||
"""Cosine similarity between two embedding vectors."""
|
||||
a = np.array(a, dtype=np.float32)
|
||||
b = np.array(b, dtype=np.float32)
|
||||
na = np.linalg.norm(a)
|
||||
nb = np.linalg.norm(b)
|
||||
if na == 0 or nb == 0:
|
||||
return 0.0
|
||||
return float(np.dot(a, b) / (na * nb))
|
||||
|
||||
|
||||
def name_similarity(name_a, name_b):
|
||||
"""
|
||||
Token-overlap-based name similarity.
|
||||
Handles formal/informal pairs (Aaron / Aaron Nelson),
|
||||
abbreviation pairs (HVAMC / Hudson Valley AMC),
|
||||
and simple transcription noise.
|
||||
"""
|
||||
a_lower = name_a.lower().strip()
|
||||
b_lower = name_b.lower().strip()
|
||||
|
||||
if a_lower == b_lower:
|
||||
return 1.0
|
||||
|
||||
# Tokenize
|
||||
a_tokens = set(re.findall(r'\b\w+\b', a_lower))
|
||||
b_tokens = set(re.findall(r'\b\w+\b', b_lower))
|
||||
|
||||
if not a_tokens or not b_tokens:
|
||||
return 0.0
|
||||
|
||||
# Substring containment (handles "Aaron" in "Aaron Nelson")
|
||||
if a_lower in b_lower or b_lower in a_lower:
|
||||
# Strong signal but not 1.0 — different lengths
|
||||
shorter = min(len(a_lower), len(b_lower))
|
||||
longer = max(len(a_lower), len(b_lower))
|
||||
return 0.7 + 0.2 * (shorter / longer)
|
||||
|
||||
# Token Jaccard (handles "Aaron Nelson" vs "Nelson, Aaron")
|
||||
intersection = a_tokens & b_tokens
|
||||
union = a_tokens | b_tokens
|
||||
jaccard = len(intersection) / len(union)
|
||||
|
||||
# Acronym check (HVAMC vs Hudson Valley Additive Manufacturing Center)
|
||||
def is_acronym(short, full):
|
||||
if len(short) >= len(full):
|
||||
return False
|
||||
if not short.isupper():
|
||||
short_upper = short.upper()
|
||||
else:
|
||||
short_upper = short
|
||||
full_words = full.split()
|
||||
if len(full_words) < 2:
|
||||
return False
|
||||
first_letters = ''.join(w[0].upper() for w in full_words if w)
|
||||
return short_upper == first_letters or short_upper in first_letters
|
||||
|
||||
if is_acronym(name_a, name_b) or is_acronym(name_b, name_a):
|
||||
return 0.85
|
||||
|
||||
return jaccard
|
||||
|
||||
|
||||
def infer_type(entity_name, summary):
|
||||
"""
|
||||
Light type inference for blocking. Heuristic-based, transparent.
|
||||
Returns one of: person, organization, project, place, concept, unknown.
|
||||
|
||||
NOT a precise classification — just enough to avoid obviously wrong
|
||||
cross-type comparisons (person vs project). When in doubt, return
|
||||
'unknown' which gets compared against everything.
|
||||
"""
|
||||
name_lower = entity_name.lower().strip()
|
||||
summary_lower = (summary or "").lower()
|
||||
|
||||
# Person: name patterns
|
||||
person_indicators = [
|
||||
# First+Last name pattern (two title-cased words, no other tokens)
|
||||
bool(re.match(r'^[A-Z][a-z]+ [A-Z][a-z]+$', entity_name.strip())),
|
||||
# Single name that's also in the summary as a person
|
||||
any(phrase in summary_lower for phrase in [
|
||||
'is a person', 'is a professor', 'is an artist', 'is a colleague',
|
||||
'is a friend', 'is a family member', 'works at', 'studied at',
|
||||
"'s spouse", "'s child", "'s parent", "'s student",
|
||||
]),
|
||||
]
|
||||
if any(person_indicators):
|
||||
return "person"
|
||||
|
||||
# Organization: company/institution indicators
|
||||
org_indicators = [
|
||||
any(suffix in name_lower for suffix in [
|
||||
' inc', ' llc', ' corp', ' company', ' university', ' college',
|
||||
' school', ' institute', ' foundation', ' department',
|
||||
]),
|
||||
any(phrase in summary_lower for phrase in [
|
||||
'is a company', 'is a university', 'is an organization',
|
||||
'is an institution', 'is a department', 'is a nonprofit',
|
||||
]),
|
||||
]
|
||||
if any(org_indicators):
|
||||
return "organization"
|
||||
|
||||
# Project: software/creative work indicators
|
||||
project_indicators = [
|
||||
any(phrase in summary_lower for phrase in [
|
||||
'is a project', 'software project', 'is a codebase',
|
||||
'is a tool', 'is a system', 'is an application',
|
||||
'is a research project', 'is a design project',
|
||||
]),
|
||||
any(suffix in name_lower for suffix in [' project', ' system', ' platform']),
|
||||
]
|
||||
if any(project_indicators):
|
||||
return "project"
|
||||
|
||||
# Place: location indicators
|
||||
place_indicators = [
|
||||
any(phrase in summary_lower for phrase in [
|
||||
'is a city', 'is a town', 'is a state', 'is a country',
|
||||
'is a neighborhood', 'is a region', 'is a location',
|
||||
]),
|
||||
]
|
||||
if any(place_indicators):
|
||||
return "place"
|
||||
|
||||
# Default
|
||||
return "unknown"
|
||||
|
||||
|
||||
def get_neighbors(graph, entity_uuid, limit=20):
|
||||
"""Get the names of entities connected to this entity (1-hop)."""
|
||||
query = """
|
||||
MATCH (e:Entity {uuid: $uuid})-[r:RELATES_TO]-(other:Entity)
|
||||
RETURN DISTINCT other.name AS name
|
||||
LIMIT $limit
|
||||
"""
|
||||
result = graph.query(query, {"uuid": entity_uuid, "limit": limit})
|
||||
return set(row[0] for row in result.result_set if row[0])
|
||||
|
||||
|
||||
def neighbor_jaccard(neighbors_a, neighbors_b):
|
||||
"""
|
||||
Asymmetric neighbor overlap (containment metric).
|
||||
|
||||
Returns |A ∩ B| / min(|A|, |B|) — the fraction of the smaller entity's
|
||||
neighbors that are also neighbors of the larger entity.
|
||||
|
||||
Asymmetric is the right metric for personal cognitive corpora, where
|
||||
one entity (e.g., the user) is a hub with hundreds of edges and alias
|
||||
candidates are smaller subset entities. Jaccard penalizes this
|
||||
asymmetry as if it were dissimilarity; containment reveals it as the
|
||||
subset relationship it is.
|
||||
|
||||
DEG-RAG used Jaccard because their academic-corpus entities are
|
||||
roughly comparable in connectivity. Personal corpora have different
|
||||
topology and need a different metric.
|
||||
"""
|
||||
if not neighbors_a and not neighbors_b:
|
||||
return 0.0
|
||||
intersection = neighbors_a & neighbors_b
|
||||
smaller = min(len(neighbors_a), len(neighbors_b))
|
||||
if smaller == 0:
|
||||
return 0.0
|
||||
return len(intersection) / smaller
|
||||
|
||||
|
||||
def get_edge_count(graph, entity_uuid):
|
||||
query = """
|
||||
MATCH (e:Entity {uuid: $uuid})-[r:RELATES_TO]-()
|
||||
RETURN count(r) AS c
|
||||
"""
|
||||
result = graph.query(query, {"uuid": entity_uuid})
|
||||
return result.result_set[0][0] if result.result_set else 0
|
||||
|
||||
|
||||
def combine_signals(name_sim, ego_sim, neighbor_sim):
|
||||
"""
|
||||
Combine the three similarity signals into a single confidence score.
|
||||
|
||||
Weighting tuned for personal cognitive corpora:
|
||||
- Summary embedding ego similarity is primary signal
|
||||
- Containment-based neighbor overlap is strong secondary (catches Aaron+Nelson
|
||||
where the smaller entity's neighbors are mostly a subset of the hub's)
|
||||
- Name similarity is tie-breaker (handles acronyms via name_similarity helper)
|
||||
|
||||
Different from DEG-RAG defaults because personal corpora have asymmetric
|
||||
topology (hub user, subset alias entities).
|
||||
"""
|
||||
# Strong neighbor containment alone is meaningful — if entity B's neighbors
|
||||
# are mostly contained in entity A's, even with different names and weak
|
||||
# name_embedding similarity, that's the asymmetric alias case (Aaron+Nelson).
|
||||
# Require some ego support but not high.
|
||||
if neighbor_sim >= 0.7 and ego_sim >= 0.3:
|
||||
return 0.4 * neighbor_sim + 0.4 * ego_sim + 0.2 * name_sim
|
||||
|
||||
# If ego is very low AND neighbor overlap is weak, probably not aliases
|
||||
if ego_sim < 0.3 and neighbor_sim < 0.4:
|
||||
return min(0.4, max(ego_sim, neighbor_sim))
|
||||
|
||||
# If name is very similar AND ego is at least moderate, high confidence
|
||||
if name_sim >= 0.85 and ego_sim >= 0.5:
|
||||
return 0.4 * ego_sim + 0.4 * name_sim + 0.2 * neighbor_sim
|
||||
|
||||
# Standard weighted average — ego primary, neighbor and name balanced
|
||||
return 0.45 * ego_sim + 0.3 * neighbor_sim + 0.25 * name_sim
|
||||
|
||||
|
||||
def compute_summary_embedding(text, model="nomic-embed-text"):
|
||||
"""
|
||||
Compute embedding for a summary text via Ollama.
|
||||
|
||||
Used to get ego similarity between entities based on what their summaries
|
||||
say (the actual semantic content) rather than just their names. Aaron's
|
||||
name_embedding and Nelson's name_embedding have low cosine similarity
|
||||
because the names are different tokens. But their summaries describe
|
||||
overlapping content (faculty member at SUNY, HVAMC, etc.) so summary
|
||||
embeddings should produce a much stronger ego signal.
|
||||
"""
|
||||
if not text or len(text) < 10:
|
||||
return None
|
||||
try:
|
||||
response = requests.post(
|
||||
"http://localhost:11434/api/embeddings",
|
||||
json={"model": model, "prompt": text[:2000]},
|
||||
timeout=30,
|
||||
)
|
||||
response.raise_for_status()
|
||||
return response.json().get("embedding")
|
||||
except Exception as e:
|
||||
print(f" Embedding error: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def precompute_summary_embeddings(entities, model="nomic-embed-text"):
|
||||
"""Compute and cache summary embeddings for all entities."""
|
||||
print(f"Computing summary embeddings via Ollama ({model})...")
|
||||
print(f" Total entities: {len(entities)}")
|
||||
|
||||
cache_path = Path("/home/aaron/aaronai/experiments/summary_embeddings_cache.json")
|
||||
cache = {}
|
||||
if cache_path.exists():
|
||||
with open(cache_path) as f:
|
||||
cache = json.load(f)
|
||||
print(f" Loaded {len(cache)} cached embeddings")
|
||||
|
||||
new_count = 0
|
||||
start = time.time()
|
||||
for i, e in enumerate(entities):
|
||||
if e["uuid"] in cache:
|
||||
e["summary_embedding"] = cache[e["uuid"]]
|
||||
continue
|
||||
emb = compute_summary_embedding(e["summary"], model=model)
|
||||
if emb:
|
||||
e["summary_embedding"] = emb
|
||||
cache[e["uuid"]] = emb
|
||||
new_count += 1
|
||||
else:
|
||||
e["summary_embedding"] = None
|
||||
|
||||
# Save cache periodically
|
||||
if new_count > 0 and new_count % 100 == 0:
|
||||
with open(cache_path, "w") as f:
|
||||
json.dump(cache, f)
|
||||
elapsed = time.time() - start
|
||||
rate = new_count / elapsed
|
||||
remaining = (len(entities) - i - 1) / rate if rate > 0 else 0
|
||||
print(f" ... {i+1}/{len(entities)} (computed {new_count} new, ~{remaining:.0f}s remaining)")
|
||||
|
||||
# Final save
|
||||
with open(cache_path, "w") as f:
|
||||
json.dump(cache, f)
|
||||
|
||||
have_embeddings = sum(1 for e in entities if e.get("summary_embedding"))
|
||||
print(f" Done. {have_embeddings}/{len(entities)} entities have summary embeddings")
|
||||
|
||||
|
||||
def generate_proposals():
|
||||
db = FalkorDB(host='localhost', port=6379)
|
||||
graph = db.select_graph(GROUP_ID)
|
||||
|
||||
# Pull all entities with embeddings
|
||||
print(f"Fetching entities from group_id '{GROUP_ID}'...")
|
||||
result = graph.query("""
|
||||
MATCH (n:Entity)
|
||||
WHERE n.name_embedding IS NOT NULL AND n.summary IS NOT NULL
|
||||
RETURN n.uuid, n.name, n.summary, n.name_embedding
|
||||
""")
|
||||
|
||||
entities = []
|
||||
for row in result.result_set:
|
||||
entities.append({
|
||||
'uuid': row[0],
|
||||
'name': row[1],
|
||||
'summary': row[2],
|
||||
'embedding': row[3],
|
||||
})
|
||||
print(f" Loaded {len(entities)} entities with embeddings")
|
||||
|
||||
# Compute summary embeddings (true ego signal, beyond name embeddings)
|
||||
precompute_summary_embeddings(entities)
|
||||
|
||||
# Infer types for blocking
|
||||
print("Inferring entity types for blocking...")
|
||||
type_counts = defaultdict(int)
|
||||
for e in entities:
|
||||
e['inferred_type'] = infer_type(e['name'], e['summary'])
|
||||
type_counts[e['inferred_type']] += 1
|
||||
for t, c in sorted(type_counts.items(), key=lambda x: -x[1]):
|
||||
print(f" {t}: {c}")
|
||||
|
||||
# Group by inferred type for blocking
|
||||
blocks = defaultdict(list)
|
||||
for e in entities:
|
||||
blocks[e['inferred_type']].append(e)
|
||||
|
||||
# 'unknown' entities get compared against everything (they might be any type)
|
||||
# Other types only get compared within their type block + against unknowns
|
||||
print()
|
||||
print("Comparing entities within type blocks...")
|
||||
proposals = []
|
||||
low_confidence = []
|
||||
comparisons_done = 0
|
||||
|
||||
# Build comparison pairs
|
||||
pairs_to_compare = []
|
||||
typed_blocks = {t: ents for t, ents in blocks.items() if t != 'unknown'}
|
||||
unknown_block = blocks.get('unknown', [])
|
||||
|
||||
# Within-type pairs (excluding unknown)
|
||||
for t, ents in typed_blocks.items():
|
||||
for i in range(len(ents)):
|
||||
for j in range(i + 1, len(ents)):
|
||||
pairs_to_compare.append((ents[i], ents[j]))
|
||||
|
||||
# Unknown vs unknown
|
||||
for i in range(len(unknown_block)):
|
||||
for j in range(i + 1, len(unknown_block)):
|
||||
pairs_to_compare.append((unknown_block[i], unknown_block[j]))
|
||||
|
||||
# Unknown vs typed (unknowns might be any type)
|
||||
for ent_unknown in unknown_block:
|
||||
for t, ents in typed_blocks.items():
|
||||
for ent_typed in ents:
|
||||
pairs_to_compare.append((ent_unknown, ent_typed))
|
||||
|
||||
print(f" Pairs to compare: {len(pairs_to_compare):,}")
|
||||
|
||||
# Compute similarities
|
||||
cache_neighbors = {}
|
||||
def neighbors_cached(uuid):
|
||||
if uuid not in cache_neighbors:
|
||||
cache_neighbors[uuid] = get_neighbors(graph, uuid)
|
||||
return cache_neighbors[uuid]
|
||||
|
||||
for ent_a, ent_b in pairs_to_compare:
|
||||
comparisons_done += 1
|
||||
if comparisons_done % 5000 == 0:
|
||||
print(f" ... {comparisons_done:,} / {len(pairs_to_compare):,}")
|
||||
|
||||
# Compute name similarity (handles formal/informal pairs, acronyms)
|
||||
name_sim = name_similarity(ent_a['name'], ent_b['name'])
|
||||
|
||||
# Compute ego similarity using SUMMARY embeddings (the actual semantic
|
||||
# content), falling back to name embeddings if summaries unavailable.
|
||||
# Summary similarity catches Aaron+Nelson where name similarity fails.
|
||||
if ent_a.get('summary_embedding') and ent_b.get('summary_embedding'):
|
||||
ego_sim_quick = cosine_similarity(ent_a['summary_embedding'], ent_b['summary_embedding'])
|
||||
else:
|
||||
ego_sim_quick = cosine_similarity(ent_a['embedding'], ent_b['embedding'])
|
||||
|
||||
# Pre-filter to avoid expensive neighbor query on obviously different pairs.
|
||||
# Lowered thresholds vs DEG-RAG defaults because personal-corpus aliases often
|
||||
# have low name_embedding similarity (different surface tokens) but high
|
||||
# neighbor overlap. We let weaker name/ego signals through to the neighbor
|
||||
# check, which can rescue them via containment metric.
|
||||
if ego_sim_quick < 0.3 and name_sim < 0.15:
|
||||
continue
|
||||
|
||||
# Full comparison
|
||||
neighbors_a = neighbors_cached(ent_a['uuid'])
|
||||
neighbors_b = neighbors_cached(ent_b['uuid'])
|
||||
neighbor_sim = neighbor_jaccard(neighbors_a, neighbors_b)
|
||||
|
||||
confidence = combine_signals(name_sim, ego_sim_quick, neighbor_sim)
|
||||
|
||||
record = {
|
||||
'entity_a': {
|
||||
'uuid': ent_a['uuid'],
|
||||
'name': ent_a['name'],
|
||||
'type': ent_a['inferred_type'],
|
||||
'summary': ent_a['summary'][:200],
|
||||
'edge_count': get_edge_count(graph, ent_a['uuid']),
|
||||
},
|
||||
'entity_b': {
|
||||
'uuid': ent_b['uuid'],
|
||||
'name': ent_b['name'],
|
||||
'type': ent_b['inferred_type'],
|
||||
'summary': ent_b['summary'][:200],
|
||||
'edge_count': get_edge_count(graph, ent_b['uuid']),
|
||||
},
|
||||
'confidence': round(confidence, 3),
|
||||
'signals': {
|
||||
'name_similarity': round(name_sim, 3),
|
||||
'ego_similarity': round(ego_sim_quick, 3),
|
||||
'neighbor_overlap': round(neighbor_sim, 3),
|
||||
},
|
||||
'shared_neighbors': sorted(list(neighbors_a & neighbors_b))[:10],
|
||||
}
|
||||
|
||||
if confidence >= HIGH_CONFIDENCE_THRESHOLD:
|
||||
proposals.append(record)
|
||||
elif confidence >= LOW_CONFIDENCE_THRESHOLD:
|
||||
low_confidence.append(record)
|
||||
|
||||
print(f"\nDone. Proposals: {len(proposals)}, Low-confidence: {len(low_confidence)}")
|
||||
return proposals, low_confidence, len(entities), len(pairs_to_compare)
|
||||
|
||||
|
||||
def write_proposals_log(proposals, low_confidence, total_entities, total_comparisons):
|
||||
timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%d-%H%M")
|
||||
out_path = PROPOSALS_DIR / f"proposals-{timestamp}.md"
|
||||
|
||||
proposals_sorted = sorted(proposals, key=lambda p: -p['confidence'])
|
||||
low_sorted = sorted(low_confidence, key=lambda p: -p['confidence'])
|
||||
|
||||
lines = []
|
||||
lines.append(f"# Consolidator 0.1 — Run {timestamp}")
|
||||
lines.append("")
|
||||
lines.append("## Statistics")
|
||||
lines.append(f"- Entities scanned: {total_entities:,}")
|
||||
lines.append(f"- Pairwise comparisons: {total_comparisons:,}")
|
||||
lines.append(f"- High-confidence proposals (≥{HIGH_CONFIDENCE_THRESHOLD}): {len(proposals)}")
|
||||
lines.append(f"- Low-confidence candidates ({LOW_CONFIDENCE_THRESHOLD}-{HIGH_CONFIDENCE_THRESHOLD}): {len(low_confidence)}")
|
||||
lines.append("")
|
||||
lines.append("## How to review")
|
||||
lines.append("")
|
||||
lines.append("For each proposal, mark your decision by changing `[ ]` to one of:")
|
||||
lines.append("- `[APPROVE]` — execute this merge on next run")
|
||||
lines.append("- `[REJECT]` — don't merge, don't propose again")
|
||||
lines.append("- `[DEFER]` — re-surface in next run for further consideration")
|
||||
lines.append("")
|
||||
lines.append("Save the file when done. Do not modify proposal_id or uuid fields.")
|
||||
lines.append("")
|
||||
lines.append("---")
|
||||
lines.append("")
|
||||
lines.append(f"## Proposed Merges (n={len(proposals)})")
|
||||
lines.append("")
|
||||
|
||||
for i, p in enumerate(proposals_sorted, start=1):
|
||||
lines.append(f"### Proposal {i}")
|
||||
lines.append("")
|
||||
lines.append(f"**Decision:** [ ]")
|
||||
lines.append("")
|
||||
lines.append(f"**Confidence:** {p['confidence']}")
|
||||
lines.append("")
|
||||
lines.append(f"**Entity A:** \"{p['entity_a']['name']}\" (type: {p['entity_a']['type']}, {p['entity_a']['edge_count']} edges)")
|
||||
lines.append(f" - uuid: `{p['entity_a']['uuid']}`")
|
||||
lines.append(f" - summary: {p['entity_a']['summary']}")
|
||||
lines.append("")
|
||||
lines.append(f"**Entity B:** \"{p['entity_b']['name']}\" (type: {p['entity_b']['type']}, {p['entity_b']['edge_count']} edges)")
|
||||
lines.append(f" - uuid: `{p['entity_b']['uuid']}`")
|
||||
lines.append(f" - summary: {p['entity_b']['summary']}")
|
||||
lines.append("")
|
||||
lines.append(f"**Signals:**")
|
||||
lines.append(f" - Name similarity: {p['signals']['name_similarity']}")
|
||||
lines.append(f" - Ego (summary) similarity: {p['signals']['ego_similarity']}")
|
||||
lines.append(f" - Neighbor overlap: {p['signals']['neighbor_overlap']}")
|
||||
if p['shared_neighbors']:
|
||||
shared_str = ', '.join(f'"{n}"' for n in p['shared_neighbors'][:8])
|
||||
lines.append(f" - Shared neighbors (sample): {shared_str}")
|
||||
lines.append("")
|
||||
lines.append("**Optional rejection note:** ")
|
||||
lines.append("")
|
||||
lines.append("---")
|
||||
lines.append("")
|
||||
|
||||
lines.append("")
|
||||
lines.append(f"## Low-Confidence Candidates (n={len(low_confidence)}, informational only, no action)")
|
||||
lines.append("")
|
||||
for p in low_sorted[:30]:
|
||||
lines.append(f"- **{p['confidence']}** \"{p['entity_a']['name']}\" + \"{p['entity_b']['name']}\" (name={p['signals']['name_similarity']}, ego={p['signals']['ego_similarity']}, nbr={p['signals']['neighbor_overlap']})")
|
||||
if len(low_sorted) > 30:
|
||||
lines.append(f"- *(...{len(low_sorted) - 30} more not shown)*")
|
||||
|
||||
out_path.write_text("\n".join(lines))
|
||||
print(f"\nProposal log written to: {out_path}")
|
||||
|
||||
# Also save raw JSON for downstream tooling
|
||||
json_path = PROPOSALS_DIR / f"proposals-{timestamp}.json"
|
||||
with open(json_path, 'w') as f:
|
||||
json.dump({
|
||||
'run_timestamp': timestamp,
|
||||
'statistics': {
|
||||
'total_entities': total_entities,
|
||||
'total_comparisons': total_comparisons,
|
||||
'proposal_count': len(proposals),
|
||||
'low_confidence_count': len(low_confidence),
|
||||
},
|
||||
'proposals': proposals_sorted,
|
||||
'low_confidence': low_sorted,
|
||||
}, f, indent=2)
|
||||
print(f"Raw JSON: {json_path}")
|
||||
|
||||
|
||||
def main():
|
||||
print("=" * 70)
|
||||
print("Consolidator 0.1 — Calibration Phase")
|
||||
print("=" * 70)
|
||||
print()
|
||||
|
||||
proposals, low_confidence, total_entities, total_comparisons = generate_proposals()
|
||||
write_proposals_log(proposals, low_confidence, total_entities, total_comparisons)
|
||||
|
||||
print()
|
||||
print("Next: review the proposals markdown file and mark APPROVE/REJECT/DEFER")
|
||||
print("for each proposal. Re-run will read decisions and execute approved merges.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
+541
-189
@@ -16,11 +16,14 @@ import os
|
||||
import json
|
||||
import sqlite3
|
||||
import argparse
|
||||
from functools import lru_cache
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
from datetime import datetime, timedelta
|
||||
from dotenv import load_dotenv
|
||||
import psycopg2
|
||||
import hashlib
|
||||
import numpy as np
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
@@ -40,6 +43,26 @@ NEXTCLOUD_USER = os.getenv("NEXTCLOUD_USER", "aaron")
|
||||
NEXTCLOUD_PASSWORD = os.getenv("NEXTCLOUD_PASSWORD", "")
|
||||
DREAMS_WEBDAV = f"{NEXTCLOUD_URL}/remote.php/dav/files/{NEXTCLOUD_USER}/Journal/Dreams"
|
||||
|
||||
# ─── Retrieval-window config (per dreamer-multimodal-design.md §2) ─────────
|
||||
# Biological grounding: NREM replays recent traces (24-72 hrs); REM links
|
||||
# across time on structural similarity, not temporal proximity. Synthesis
|
||||
# pulls from salience across the full corpus (no window). Spec calls for
|
||||
# these to be mutable rather than hardcoded — this is the mutable home.
|
||||
TIME_WINDOWS_HOURS = {
|
||||
"nrem": 72, # 24-72 hrs, take wider end
|
||||
"early-rem": 24 * 30, # 30 days
|
||||
"late-rem": 24 * 90, # 90 days
|
||||
"lucid": None, # no window
|
||||
}
|
||||
|
||||
# Maximal Marginal Relevance: λ=1 → pure relevance, λ=0 → pure diversity.
|
||||
# 0.5 is the standard balance; tune later if the dossier-cluster problem
|
||||
# isn't sufficiently broken up.
|
||||
MMR_LAMBDA = 0.5
|
||||
|
||||
# Fast/cheap model for query generation. Sonnet for synthesis (in synthesize_*).
|
||||
LLM_QUERY_MODEL = os.getenv("DREAMER_QUERY_MODEL", "claude-haiku-4-5-20251001")
|
||||
|
||||
# Similarity ranges calibrated for all-MiniLM-L6-v2
|
||||
MODE_RANGES = {
|
||||
"nrem": (0.48, 0.72),
|
||||
@@ -64,6 +87,117 @@ def prompt_hash(prompts: list[str]) -> str:
|
||||
combined = "".join(prompts)
|
||||
return hashlib.md5(combined.encode()).hexdigest()[:8]
|
||||
|
||||
# ─── Prompt templates ───────────────────────────────────────────────────────
|
||||
# Module-level so prompt_hash() can hash actual prompt content. Any change to
|
||||
# any template — even a single character — flips the manifest's prompt_hash.
|
||||
# Templates use str.format() placeholders ({chunk_text}, {nrem_output}, ...);
|
||||
# do not switch back to f-strings (the constant must be hashable independent
|
||||
# of variable values). Literal { or } in template text would need to be
|
||||
# doubled ({{, }}) — currently no template contains literal braces.
|
||||
|
||||
NREM_PROMPT_TEMPLATE = """You have read everything Aaron Nelson has written and published.
|
||||
You are a careful colleague who noticed something this week.
|
||||
|
||||
Here is material from his corpus:
|
||||
|
||||
{chunk_text}
|
||||
|
||||
Write to Aaron directly. Identify one specific connection between
|
||||
this material and something he wrote or worked on previously.
|
||||
Stay close to the documents — cite them specifically by name.
|
||||
Do not speculate beyond what the material supports. Do not use
|
||||
headers or bullet points. Write one paragraph of 200-300 words
|
||||
that ends with a single concrete question he could act on."""
|
||||
|
||||
EARLY_REM_PROMPT_TEMPLATE = """Something was noticed earlier tonight, moving through Aaron's recent work:
|
||||
|
||||
{nrem_output}
|
||||
|
||||
That observation is still with you. Now here is material from a different
|
||||
time — pulled from further back, from different parts of his corpus:
|
||||
|
||||
{chunk_text}
|
||||
|
||||
You are not analyzing. You are recognizing.
|
||||
|
||||
Something in the earlier observation and something in this older material
|
||||
are the same thing wearing different clothes. Find it. Don't explain why
|
||||
they're connected — just let the connection speak. Write from inside the
|
||||
recognition, not from above it.
|
||||
|
||||
The emotional register underneath the career logic is more interesting
|
||||
than the career logic. The pattern that has been repeating longer than
|
||||
he has been aware of it is more interesting than the current instance.
|
||||
|
||||
Write directly to Aaron. No citations, no references, no analysis.
|
||||
First person, present tense. Let what you noticed arrive rather than
|
||||
be delivered. 150-250 words. End with one thing that is true that
|
||||
he probably already knows but hasn't said out loud yet."""
|
||||
|
||||
LATE_REM_PROMPT_TEMPLATE = """You have been moving through Aaron Nelson's corpus all night.
|
||||
First you found this, in the careful light of early consolidation:
|
||||
|
||||
{nrem_output}
|
||||
|
||||
Then, in the more personal territory that followed:
|
||||
|
||||
{early_rem_output}
|
||||
|
||||
Now it is late. The boundaries between things have loosened.
|
||||
Here is material pulled from opposite ends of his work:
|
||||
|
||||
{chunk_text}
|
||||
|
||||
Do not explain the connections between all of this.
|
||||
Do not resolve them. Do not summarize what came before.
|
||||
Something stranger is possible now — let the accumulated
|
||||
material from the night find its own shape. Compressed,
|
||||
associative, slightly off. Let the strangeness stand.
|
||||
|
||||
No headers. No bullet points. No hedging. No resolution.
|
||||
No offer. End mid-thought if that is where the material ends.
|
||||
150-250 words."""
|
||||
|
||||
SYNTHESIS_PROMPT_TEMPLATE = """You have spent the night moving through Aaron Nelson's corpus
|
||||
in three passes, each building on the last.
|
||||
|
||||
The first pass — careful, close to the documents:
|
||||
{nrem_output}
|
||||
|
||||
The second pass — more personal, following what the first opened:
|
||||
{early_rem_output}
|
||||
|
||||
The third pass — associative, strange, letting things touch that
|
||||
don't normally touch:
|
||||
{late_rem_output}
|
||||
|
||||
Now synthesize. Not a summary — a synthesis. Find what runs through
|
||||
all three that none of them said directly. The thing that only becomes
|
||||
visible when you hold all three passes together.
|
||||
|
||||
Write it as a single unbroken piece. No headers, no bullet points,
|
||||
no stage labels. 200-300 words. End with the one question that
|
||||
matters most right now."""
|
||||
|
||||
LUCID_PROMPT_TEMPLATE = """Aaron has a question he is sitting with:
|
||||
|
||||
{task}
|
||||
|
||||
You have searched his entire corpus and found material that
|
||||
speaks to this question from unexpected directions. Here is
|
||||
what you found:
|
||||
|
||||
{chunk_text}
|
||||
|
||||
Do not summarize. Do not list. Pick the most interesting
|
||||
tension between what the corpus contains and what he is
|
||||
asking, and follow it through to its conclusion. Cite
|
||||
specific documents by name. Be direct about what you think.
|
||||
No headers, no bullet points. 250-400 words.
|
||||
End with an offer to work on it together."""
|
||||
|
||||
LUCID_DEFAULT_TASK = "What should I be thinking about that I am not?"
|
||||
|
||||
def extract_folder(source_path):
|
||||
"""Extract top-level Nextcloud folder from source path."""
|
||||
parts = source_path.replace("\\", "/").split("/")
|
||||
@@ -111,11 +245,16 @@ def get_recent_conversation_topics(days=14):
|
||||
# ─── Stage 2: Retrieve ──────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def retrieve_graphiti(mode, task=None, n_results=8):
|
||||
def retrieve_graphiti(mode, task=None, n_results=8, excluded_sources=None):
|
||||
"""E3 experiment — Graphiti substrate retrieval.
|
||||
Queries Graphiti /search endpoint instead of pgvector.
|
||||
Returns chunks in same format as retrieve() for pipeline compatibility.
|
||||
Note: content is Graphiti facts (synthesized relationships), not raw chunks.
|
||||
|
||||
Over-fetches by 3x to allow in-process filtering against excluded_sources,
|
||||
matching the cross-stage exclusion mechanism the pgvector branch uses.
|
||||
Without this filter, NREM/Early REM/Late REM would see overlapping content
|
||||
and the score-band Early REM exclusion (v1.1) would not apply in Graphiti mode.
|
||||
"""
|
||||
import requests as req_lib
|
||||
if task:
|
||||
@@ -129,92 +268,335 @@ def retrieve_graphiti(mode, task=None, n_results=8):
|
||||
else:
|
||||
query = "research fabrication teaching practice recent work"
|
||||
|
||||
excluded_sources = excluded_sources or set()
|
||||
# Over-fetch so in-process exclusion still leaves enough results
|
||||
fetch_limit = n_results * 3 if excluded_sources else n_results
|
||||
|
||||
try:
|
||||
resp = req_lib.get(
|
||||
"http://localhost:8001/search",
|
||||
params={"query": query, "limit": n_results, "group_id": "aaron"},
|
||||
params={"query": query, "limit": fetch_limit, "group_id": "aaron"},
|
||||
timeout=30,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
results = resp.json().get("results", [])
|
||||
chunks = []
|
||||
seen_sources = set()
|
||||
for r in results:
|
||||
fact = r.get("fact", "")
|
||||
if not fact.strip():
|
||||
continue
|
||||
source = r.get("source", "graphiti")
|
||||
if source in excluded_sources:
|
||||
continue
|
||||
if source in seen_sources:
|
||||
continue
|
||||
chunks.append({
|
||||
"source": r.get("source", "graphiti"),
|
||||
"source": source,
|
||||
"content": fact,
|
||||
"relevance": r.get("score", 0.5),
|
||||
"similarity": r.get("score", 0.5),
|
||||
})
|
||||
seen_sources.add(source)
|
||||
if len(chunks) >= n_results:
|
||||
break
|
||||
return chunks
|
||||
except Exception as e:
|
||||
print(f"[Graphiti retrieval error: {e}] — falling back to empty.")
|
||||
return []
|
||||
|
||||
def retrieve(mode, task=None, n_results=8, excluded_sources=None):
|
||||
# E3 experiment: DREAMER_SUBSTRATE=graphiti routes retrieval to Graphiti /search
|
||||
# Default behavior: pgvector similarity search (unchanged)
|
||||
substrate = os.getenv("DREAMER_SUBSTRATE", "pgvector")
|
||||
if substrate == "graphiti":
|
||||
return retrieve_graphiti(mode, task=task, n_results=n_results)
|
||||
@lru_cache(maxsize=1)
|
||||
def _get_embedder():
|
||||
from sentence_transformers import SentenceTransformer
|
||||
embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
||||
low, high = MODE_RANGES[mode]
|
||||
return SentenceTransformer("all-MiniLM-L6-v2")
|
||||
|
||||
def _llm_generate_queries(mode, signal, task=None, n_queries=4):
|
||||
"""Park et al. 2023 reflection-style query generation. Feeds the LLM the
|
||||
observation signal + a mode-specific framing; emits N retrieval queries
|
||||
that probe different corners of the recent corpus instead of the same
|
||||
hardcoded string every night. Sources cited in dream_observation.py.
|
||||
|
||||
Falls back to recent_questions from the signal if the LLM call fails."""
|
||||
import anthropic
|
||||
|
||||
if task:
|
||||
query = task
|
||||
elif mode == "late-rem":
|
||||
delta = observe_corpus()
|
||||
topics = delta.get("recent_topics", [])
|
||||
query = topics[0] if topics else "practice place memory making"
|
||||
elif mode == "early-rem":
|
||||
query = "career decision personal change what matters next"
|
||||
# Lucid mode: decompose the user's task into sub-queries
|
||||
prompt = (
|
||||
f"Decompose this user task into {n_queries} distinct sub-questions, "
|
||||
f"each suitable as a retrieval query against Aaron's personal corpus.\n\n"
|
||||
f"TASK: {task}\n\n"
|
||||
f'Output JSON ONLY: {{"queries": ["...", "...", ...]}}'
|
||||
)
|
||||
else:
|
||||
query = "research fabrication teaching practice recent work"
|
||||
mode_framings = {
|
||||
"nrem": (
|
||||
"NREM is replay-and-consolidation of RECENT traces. Generate queries "
|
||||
"that probe what Aaron has been working on or capturing in the last "
|
||||
"few days. Concrete entities — project names, course codes, named "
|
||||
"subjects. The dreamer is re-touching specific recent material to "
|
||||
"strengthen schema connections, not finding novel content."
|
||||
),
|
||||
"early-rem": (
|
||||
"Early REM is associative bridging with emotional/personal register. "
|
||||
"Generate queries that surface unresolved themes, career questions, "
|
||||
"ongoing personal threads — material that connects intellectual and "
|
||||
"emotional dimensions. Tone: thoughtful friend, not researcher."
|
||||
),
|
||||
"late-rem": (
|
||||
"Late REM tests novel connections across DISTANT material. Generate "
|
||||
"queries that pair concrete subjects from DIFFERENT domains of Aaron's "
|
||||
"work (e.g., one from academic teaching, one from consulting, one from "
|
||||
"creative practice) to probe for surprising structural similarity. "
|
||||
"Cross-domain is required."
|
||||
),
|
||||
}
|
||||
framing = mode_framings.get(mode, mode_framings["nrem"])
|
||||
questions_snippet = "\n".join(
|
||||
f" - {q[:200]}" for q in signal.get("recent_questions", [])[:8]
|
||||
) or " (no recent user questions)"
|
||||
journal_snippet = ", ".join(signal.get("new_journal_entries", [])[:5]) or "(none)"
|
||||
days_str = (
|
||||
f"{signal['days_since_dream']:.1f}"
|
||||
if signal.get("days_since_dream") not in (None, float("inf"))
|
||||
else "infinite (first dream)"
|
||||
)
|
||||
prompt = (
|
||||
f"You generate retrieval queries for an Active Inference dreamer. The "
|
||||
f"dreamer surfaces prediction errors — gaps between Aaron's model and "
|
||||
f"reality — not summaries or generic associations.\n\n"
|
||||
f"MODE: {mode}\n"
|
||||
f"FRAMING: {framing}\n\n"
|
||||
f"OBSERVATION SIGNAL:\n"
|
||||
f"- Days since last dream: {days_str}\n"
|
||||
f"- New chunks since last dream: {signal.get('new_chunks', 0)}\n"
|
||||
f"- New journal entries: {journal_snippet}\n"
|
||||
f"- Underprocessed chunks pool: {signal.get('underprocessed_count', 0):,}\n\n"
|
||||
f"RECENT USER QUESTIONS (last 14 days, top 8):\n{questions_snippet}\n\n"
|
||||
f"Generate {n_queries} retrieval queries. Requirements:\n"
|
||||
f"- Use concrete entities, named projects, course codes, specific topics "
|
||||
f"— NOT generic phrasing like 'research work practice'\n"
|
||||
f"- Each query probes a DIFFERENT corner of recent activity\n"
|
||||
f"- Match the {mode} framing\n"
|
||||
f"- 5-15 words each\n\n"
|
||||
f'Output JSON ONLY: {{"queries": ["...", "...", ...]}}'
|
||||
)
|
||||
|
||||
embedding = embedder.encode([query]).tolist()[0]
|
||||
chunks = []
|
||||
seen_sources = set()
|
||||
try:
|
||||
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
|
||||
resp = client.messages.create(
|
||||
model=LLM_QUERY_MODEL,
|
||||
max_tokens=512,
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
)
|
||||
text = "".join(b.text for b in resp.content if hasattr(b, "text")).strip()
|
||||
if text.startswith("```"):
|
||||
text = text.split("```", 2)[1]
|
||||
if text.startswith("json"):
|
||||
text = text[4:]
|
||||
text = text.strip()
|
||||
data = json.loads(text)
|
||||
queries = data.get("queries", [])
|
||||
if isinstance(queries, list) and queries:
|
||||
return [str(q).strip() for q in queries[:n_queries] if str(q).strip()]
|
||||
except Exception as e:
|
||||
print(f"[dream] LLM query generation failed ({e}); falling back to recent questions")
|
||||
|
||||
fallback = signal.get("recent_questions", [])[:n_queries] if signal else []
|
||||
return fallback or [task or "recent activity decisions thinking"]
|
||||
|
||||
|
||||
def _mmr_select(candidate_embeddings, query_embedding, n, lambda_=MMR_LAMBDA):
|
||||
"""Maximal Marginal Relevance — greedy selection that balances relevance
|
||||
against pairwise diversity. Carbonell & Goldstein 1998. Used to prevent
|
||||
cluster lock-in (e.g., 8 dossier-narrative variants filling all 8 slots).
|
||||
|
||||
candidate_embeddings: (N, D) numpy array
|
||||
query_embedding: (D,) numpy array
|
||||
Returns: list of indices into candidate_embeddings, len ≤ n."""
|
||||
if len(candidate_embeddings) == 0:
|
||||
return []
|
||||
n = min(n, len(candidate_embeddings))
|
||||
cands = candidate_embeddings / (np.linalg.norm(candidate_embeddings, axis=1, keepdims=True) + 1e-9)
|
||||
q = query_embedding / (np.linalg.norm(query_embedding) + 1e-9)
|
||||
relevance = cands @ q
|
||||
selected = []
|
||||
remaining = list(range(len(cands)))
|
||||
while len(selected) < n and remaining:
|
||||
if not selected:
|
||||
best = max(remaining, key=lambda i: relevance[i])
|
||||
else:
|
||||
sel = cands[selected]
|
||||
scores = {
|
||||
i: lambda_ * relevance[i] - (1 - lambda_) * float((cands[i] @ sel.T).max())
|
||||
for i in remaining
|
||||
}
|
||||
best = max(scores, key=scores.get)
|
||||
selected.append(best)
|
||||
remaining.remove(best)
|
||||
return selected
|
||||
|
||||
|
||||
def _bump_consolidation_cursor(chunks):
|
||||
"""Increment consolidation_count + set last_consolidated_at=NOW() for each
|
||||
source represented in chunks. Called from dream_pipeline after NREM
|
||||
completes. Per sharp-wave-ripples biology, NREM does the actual
|
||||
consolidation; REM is associative use, so we only bump on NREM."""
|
||||
if not chunks:
|
||||
return
|
||||
sources = list({c["source"] for c in chunks if c.get("source")})
|
||||
if not sources:
|
||||
return
|
||||
try:
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
excluded_sources = excluded_sources or set()
|
||||
if excluded_sources:
|
||||
cur.execute("""
|
||||
SELECT document, source, 1 - (embedding <=> %s::vector) as similarity
|
||||
FROM embeddings
|
||||
WHERE source NOT IN %s
|
||||
ORDER BY embedding <=> %s::vector
|
||||
LIMIT %s
|
||||
""", (embedding, tuple(excluded_sources), embedding, n_results * 3))
|
||||
else:
|
||||
cur.execute("""
|
||||
SELECT document, source, 1 - (embedding <=> %s::vector) as similarity
|
||||
FROM embeddings
|
||||
ORDER BY embedding <=> %s::vector
|
||||
LIMIT %s
|
||||
""", (embedding, embedding, n_results * 3))
|
||||
|
||||
for doc, source, similarity in cur.fetchall():
|
||||
if not (low <= similarity <= high):
|
||||
continue
|
||||
if source in seen_sources:
|
||||
continue
|
||||
chunks.append({
|
||||
"source": source or "unknown",
|
||||
"content": doc,
|
||||
"relevance": similarity,
|
||||
"similarity": similarity,
|
||||
})
|
||||
seen_sources.add(source)
|
||||
if len(chunks) >= n_results:
|
||||
break
|
||||
cur.execute(
|
||||
"UPDATE embeddings "
|
||||
"SET consolidation_count = consolidation_count + 1, "
|
||||
" last_consolidated_at = NOW() "
|
||||
"WHERE source = ANY(%s)",
|
||||
(sources,),
|
||||
)
|
||||
pg.commit()
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
print(f"pgvector retrieval error: {e}")
|
||||
print(f"[dream] cursor bump failed (non-fatal): {e}")
|
||||
|
||||
|
||||
def retrieve(mode, task=None, n_results=8, excluded_sources=None,
|
||||
type_filter=None, signal=None):
|
||||
"""Refactored retrieval — see dreamer-design-spec.md Stage 3 + the
|
||||
external-literature prescription in birdai-dreamer-exclusion-finding-2026-05-02.md.
|
||||
|
||||
Changes from the prior hardcoded-query version:
|
||||
- Queries are LLM-generated from the observation signal (Park et al.
|
||||
reflection pattern) instead of fixed strings. Solves the "same 8 sources
|
||||
every night" failure where fixed seeds locked into one neighborhood.
|
||||
- Per-mode time windows (24-72hr NREM / 30d Early REM / 90d Late REM)
|
||||
filter candidates before vector search. Spec calls for these to be
|
||||
mutable; they live in TIME_WINDOWS_HOURS.
|
||||
- NREM biases toward under-processed chunks (low consolidation_count).
|
||||
Biologically motivated: sharp-wave ripples tag what to replay, not
|
||||
uniform sampling.
|
||||
- Multiple queries (4 by default) → over-fetch → MMR merge for
|
||||
within-night diversity. Prevents cluster domination.
|
||||
|
||||
signal is the observation-signal dict from dream_observation.observe_corpus().
|
||||
If None, observe_corpus is called inline (back-compat for ad-hoc invocation).
|
||||
"""
|
||||
# E3 substrate experiment unchanged
|
||||
substrate = os.getenv("DREAMER_SUBSTRATE", "pgvector")
|
||||
if substrate == "graphiti":
|
||||
return retrieve_graphiti(mode, task=task, n_results=n_results,
|
||||
excluded_sources=excluded_sources)
|
||||
|
||||
if signal is None:
|
||||
from dream_observation import observe_corpus as _obs
|
||||
signal = _obs()
|
||||
|
||||
queries = _llm_generate_queries(mode, signal, task=task, n_queries=4)
|
||||
if not queries:
|
||||
print(f"[dream:{mode}] no queries generated; bailing")
|
||||
return []
|
||||
print(f"[dream:{mode}] generated queries: {queries}")
|
||||
|
||||
embedder = _get_embedder()
|
||||
excluded_sources = excluded_sources or set()
|
||||
window_hours = TIME_WINDOWS_HOURS.get(mode)
|
||||
per_query_n = 12 # over-fetch for MMR
|
||||
|
||||
candidates = []
|
||||
seen_ids = set()
|
||||
try:
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
for q in queries:
|
||||
q_emb = embedder.encode([q]).tolist()[0]
|
||||
where, params = [], []
|
||||
if excluded_sources:
|
||||
where.append("source NOT IN %s")
|
||||
params.append(tuple(excluded_sources))
|
||||
if type_filter:
|
||||
where.append("type = ANY(%s)")
|
||||
params.append(list(type_filter))
|
||||
if window_hours is not None:
|
||||
# created_at is TEXT (legacy); cast it. NULL created_at fails
|
||||
# the comparison so legacy rows are excluded from windowed
|
||||
# modes — correct: NULL means "indexed before cursor existed,"
|
||||
# which by definition is older than any window.
|
||||
where.append(
|
||||
f"(created_at IS NOT NULL AND "
|
||||
f"created_at::timestamptz > NOW() - INTERVAL '{int(window_hours)} hours')"
|
||||
)
|
||||
where_clause = ("WHERE " + " AND ".join(where)) if where else ""
|
||||
# NREM bias: order by consolidation_count ASC first (under-processed
|
||||
# chunks win the tiebreak before vector distance). Other modes:
|
||||
# vector distance only.
|
||||
order_clause = (
|
||||
"ORDER BY consolidation_count ASC, embedding <=> %s::vector"
|
||||
if mode == "nrem"
|
||||
else "ORDER BY embedding <=> %s::vector"
|
||||
)
|
||||
cur.execute(f"""
|
||||
SELECT id, document, source, type, embedding,
|
||||
1 - (embedding <=> %s::vector) as similarity
|
||||
FROM embeddings
|
||||
{where_clause}
|
||||
{order_clause}
|
||||
LIMIT %s
|
||||
""", [q_emb, *params, q_emb, per_query_n])
|
||||
for row in cur.fetchall():
|
||||
if row[0] in seen_ids:
|
||||
continue
|
||||
seen_ids.add(row[0])
|
||||
emb = row[4]
|
||||
# pgvector returns embeddings as string "[...]" by default
|
||||
if isinstance(emb, str):
|
||||
emb = np.array([float(x) for x in emb.strip("[]").split(",")])
|
||||
else:
|
||||
emb = np.array(emb)
|
||||
candidates.append({
|
||||
"id": row[0],
|
||||
"content": row[1],
|
||||
"source": row[2] or "unknown",
|
||||
"type": row[3],
|
||||
"embedding": emb,
|
||||
"similarity": float(row[5]),
|
||||
})
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
import traceback
|
||||
print(f"[dream:{mode}] retrieval SQL error: {e}")
|
||||
traceback.print_exc()
|
||||
return []
|
||||
|
||||
if not candidates:
|
||||
print(f"[dream:{mode}] zero candidates after filters")
|
||||
return []
|
||||
|
||||
# MMR over the union, using the first query as pivot for the relevance term.
|
||||
# Averaging query embeddings would be theoretically cleaner but adds
|
||||
# complexity for marginal benefit at this scale.
|
||||
pivot_emb = np.array(embedder.encode([queries[0]]).tolist()[0])
|
||||
cand_embs = np.array([c["embedding"] for c in candidates])
|
||||
selected_idx = _mmr_select(cand_embs, pivot_emb, n=n_results * 2)
|
||||
|
||||
# Post-MMR source-level dedup (multi-chunk same source collapses to one).
|
||||
chunks = []
|
||||
seen_sources = set()
|
||||
for i in selected_idx:
|
||||
c = candidates[i]
|
||||
if c["source"] in seen_sources:
|
||||
continue
|
||||
seen_sources.add(c["source"])
|
||||
chunks.append({
|
||||
"source": c["source"],
|
||||
"content": c["content"],
|
||||
"relevance": c["similarity"],
|
||||
"similarity": c["similarity"],
|
||||
"type": c["type"],
|
||||
})
|
||||
if len(chunks) >= n_results:
|
||||
break
|
||||
|
||||
return chunks
|
||||
|
||||
@@ -222,124 +604,39 @@ def retrieve(mode, task=None, n_results=8, excluded_sources=None):
|
||||
|
||||
def synthesize_nrem(chunks):
|
||||
chunk_text = "\n\n---\n\n".join([f"[{c['source']}]\n{c['content']}" for c in chunks])
|
||||
prompt = f"""You have read everything Aaron Nelson has written and published.
|
||||
You are a careful colleague who noticed something this week.
|
||||
|
||||
Here is material from his corpus:
|
||||
|
||||
{chunk_text}
|
||||
|
||||
Write to Aaron directly. Identify one specific connection between
|
||||
this material and something he wrote or worked on previously.
|
||||
Stay close to the documents — cite them specifically by name.
|
||||
Do not speculate beyond what the material supports. Do not use
|
||||
headers or bullet points. Write one paragraph of 200-300 words
|
||||
that ends with a single concrete question he could act on."""
|
||||
return _call_claude(prompt)
|
||||
return _call_claude(NREM_PROMPT_TEMPLATE.format(chunk_text=chunk_text))
|
||||
|
||||
|
||||
def synthesize_early_rem(chunks, nrem_output):
|
||||
# v1.1 — removed citation instruction, removed close-friend persona,
|
||||
# shifted register from analysis to recognition.
|
||||
chunk_text = "\n\n---\n\n".join([f"[{c['source']}]\n{c['content']}" for c in chunks])
|
||||
prompt = f"""Something was noticed earlier tonight, moving through Aaron's recent work:
|
||||
|
||||
{nrem_output}
|
||||
|
||||
That observation is still with you. Now here is material from a different
|
||||
time — pulled from further back, from different parts of his corpus:
|
||||
|
||||
{chunk_text}
|
||||
|
||||
You are not analyzing. You are recognizing.
|
||||
|
||||
Something in the earlier observation and something in this older material
|
||||
are the same thing wearing different clothes. Find it. Don't explain why
|
||||
they're connected — just let the connection speak. Write from inside the
|
||||
recognition, not from above it.
|
||||
|
||||
The emotional register underneath the career logic is more interesting
|
||||
than the career logic. The pattern that has been repeating longer than
|
||||
he has been aware of it is more interesting than the current instance.
|
||||
|
||||
Write directly to Aaron. No citations, no references, no analysis.
|
||||
First person, present tense. Let what you noticed arrive rather than
|
||||
be delivered. 150-250 words. End with one thing that is true that
|
||||
he probably already knows but hasn't said out loud yet."""
|
||||
return _call_claude(prompt)
|
||||
return _call_claude(EARLY_REM_PROMPT_TEMPLATE.format(
|
||||
nrem_output=nrem_output, chunk_text=chunk_text))
|
||||
|
||||
|
||||
def synthesize_late_rem(chunks, nrem_output, early_rem_output):
|
||||
chunk_text = "\n\n---\n\n".join([f"[{c['source']}]\n{c['content']}" for c in chunks])
|
||||
prompt = f"""You have been moving through Aaron Nelson's corpus all night.
|
||||
First you found this, in the careful light of early consolidation:
|
||||
|
||||
{nrem_output}
|
||||
|
||||
Then, in the more personal territory that followed:
|
||||
|
||||
{early_rem_output}
|
||||
|
||||
Now it is late. The boundaries between things have loosened.
|
||||
Here is material pulled from opposite ends of his work:
|
||||
|
||||
{chunk_text}
|
||||
|
||||
Do not explain the connections between all of this.
|
||||
Do not resolve them. Do not summarize what came before.
|
||||
Something stranger is possible now — let the accumulated
|
||||
material from the night find its own shape. Compressed,
|
||||
associative, slightly off. Let the strangeness stand.
|
||||
|
||||
No headers. No bullet points. No hedging. No resolution.
|
||||
No offer. End mid-thought if that is where the material ends.
|
||||
150-250 words."""
|
||||
return _call_claude(prompt)
|
||||
return _call_claude(LATE_REM_PROMPT_TEMPLATE.format(
|
||||
nrem_output=nrem_output,
|
||||
early_rem_output=early_rem_output,
|
||||
chunk_text=chunk_text))
|
||||
|
||||
|
||||
def synthesize_final(nrem_output, early_rem_output, late_rem_output):
|
||||
prompt = f"""You have spent the night moving through Aaron Nelson's corpus
|
||||
in three passes, each building on the last.
|
||||
|
||||
The first pass — careful, close to the documents:
|
||||
{nrem_output}
|
||||
|
||||
The second pass — more personal, following what the first opened:
|
||||
{early_rem_output}
|
||||
|
||||
The third pass — associative, strange, letting things touch that
|
||||
don't normally touch:
|
||||
{late_rem_output}
|
||||
|
||||
Now synthesize. Not a summary — a synthesis. Find what runs through
|
||||
all three that none of them said directly. The thing that only becomes
|
||||
visible when you hold all three passes together.
|
||||
|
||||
Write it as a single unbroken piece. No headers, no bullet points,
|
||||
no stage labels. 200-300 words. End with the one question that
|
||||
matters most right now."""
|
||||
return _call_claude(prompt, max_tokens=800)
|
||||
return _call_claude(
|
||||
SYNTHESIS_PROMPT_TEMPLATE.format(
|
||||
nrem_output=nrem_output,
|
||||
early_rem_output=early_rem_output,
|
||||
late_rem_output=late_rem_output),
|
||||
max_tokens=800)
|
||||
|
||||
|
||||
def synthesize_lucid(chunks, task):
|
||||
chunk_text = "\n\n---\n\n".join([f"[{c['source']}]\n{c['content']}" for c in chunks])
|
||||
prompt = f"""Aaron has a question he is sitting with:
|
||||
|
||||
{task or "What should I be thinking about that I am not?"}
|
||||
|
||||
You have searched his entire corpus and found material that
|
||||
speaks to this question from unexpected directions. Here is
|
||||
what you found:
|
||||
|
||||
{chunk_text}
|
||||
|
||||
Do not summarize. Do not list. Pick the most interesting
|
||||
tension between what the corpus contains and what he is
|
||||
asking, and follow it through to its conclusion. Cite
|
||||
specific documents by name. Be direct about what you think.
|
||||
No headers, no bullet points. 250-400 words.
|
||||
End with an offer to work on it together."""
|
||||
return _call_claude(prompt)
|
||||
resolved_task = task or LUCID_DEFAULT_TASK
|
||||
return _call_claude(LUCID_PROMPT_TEMPLATE.format(
|
||||
task=resolved_task, chunk_text=chunk_text))
|
||||
|
||||
|
||||
def _call_claude(prompt, max_tokens=1000):
|
||||
@@ -418,10 +715,10 @@ def write_manifest(date_str, stage_data, corpus_data):
|
||||
"prompt_sig": prompt_signature(),
|
||||
"dreamer_version": DREAMER_VERSION,
|
||||
"prompt_hash": prompt_hash([
|
||||
synthesize_nrem.__doc__ or "",
|
||||
synthesize_early_rem.__doc__ or "",
|
||||
synthesize_late_rem.__doc__ or "",
|
||||
synthesize_final.__doc__ or "",
|
||||
NREM_PROMPT_TEMPLATE,
|
||||
EARLY_REM_PROMPT_TEMPLATE,
|
||||
LATE_REM_PROMPT_TEMPLATE,
|
||||
SYNTHESIS_PROMPT_TEMPLATE,
|
||||
]),
|
||||
"stages": stage_data,
|
||||
"corpus": corpus_data,
|
||||
@@ -432,36 +729,71 @@ def write_manifest(date_str, stage_data, corpus_data):
|
||||
auth = (NEXTCLOUD_USER, NEXTCLOUD_PASSWORD)
|
||||
url = f"{DREAMS_WEBDAV}/dream-manifest-{date_str}.json"
|
||||
try:
|
||||
requests.put(url, data=content.encode("utf-8"), auth=auth, timeout=30)
|
||||
response = requests.put(url, data=content.encode("utf-8"), auth=auth, timeout=30)
|
||||
response.raise_for_status()
|
||||
print(f"Manifest written: Journal/Dreams/dream-manifest-{date_str}.json")
|
||||
except Exception as e:
|
||||
print(f"Manifest write failed (non-critical): {e}")
|
||||
print(f"Manifest write failed — manifest not persisted: {e}")
|
||||
|
||||
|
||||
def dream_pipeline():
|
||||
def dream_pipeline(type_filter=None):
|
||||
"""
|
||||
Full nightly pipeline — interdependent stages.
|
||||
NREM output feeds Early REM. Both feed Late REM. All three feed Synthesis.
|
||||
|
||||
Per dreamer-design-spec.md, this now runs Stage 1 (observe) and Stage 2
|
||||
(select) first. If select_mode returns None — corpus unchanged and no new
|
||||
journal entry — the dreamer goes quiet rather than manufacturing novelty.
|
||||
Otherwise NREM/Early-REM/Late-REM run with LLM-generated queries seeded
|
||||
from the observation signal.
|
||||
"""
|
||||
print(f"Dreamer pipeline starting — {datetime.now().strftime('%Y-%m-%d %H:%M')}")
|
||||
|
||||
state = load_dreamer_state()
|
||||
previously_retrieved = set(state.get("retrieved_sources", []))
|
||||
state.pop("retrieved_sources", None) # legacy key; session-scoped novelty now
|
||||
session_retrieved = set()
|
||||
|
||||
delta = observe_corpus()
|
||||
print(f"Corpus: {delta['new_chunks']} new chunks, {delta['days_since_dream']:.1f} days since last dream")
|
||||
print(f"Excluding {len(previously_retrieved)} previously retrieved sources")
|
||||
# ── Stage 1 + 2: Observe + Select ──────────────────────────────────────
|
||||
from dream_observation import observe_corpus as _obs, select_mode as _select
|
||||
signal = _obs()
|
||||
print(
|
||||
f"Signal: new_chunks={signal['new_chunks']}, "
|
||||
f"new_journal={len(signal['new_journal_entries'])}, "
|
||||
f"days_since={signal['days_since_dream']:.1f}, "
|
||||
f"underprocessed={signal['underprocessed_count']:,}"
|
||||
)
|
||||
selected = _select(signal)
|
||||
if selected is None:
|
||||
print("[select_mode] None — nothing worth dreaming about tonight (going quiet)")
|
||||
# Update last-dream-attempted-at but not last_dream — caller can distinguish
|
||||
# an actual dream from a skipped night by looking at last_dream_file or
|
||||
# checking the manifest dir.
|
||||
state["last_select_quiet_at"] = datetime.now().isoformat()
|
||||
save_dreamer_state(state)
|
||||
return None
|
||||
print(f"[select_mode] → {selected}")
|
||||
|
||||
# ── Stage 1: NREM ──────────────────────────────────────────────────────
|
||||
# The pipeline always runs all three modes for the manifest's continuity.
|
||||
# select_mode's choice signals the *primary* focus; the others still run
|
||||
# but draw from their own mode-appropriate windows.
|
||||
primary_mode = selected
|
||||
|
||||
# ── Stage 3: NREM ──────────────────────────────────────────────────────
|
||||
print("\n[NREM] Retrieving...")
|
||||
nrem_chunks = retrieve("nrem", excluded_sources=previously_retrieved | session_retrieved)
|
||||
# NREM is replay-and-consolidation — does not exclude prior traces.
|
||||
# Late REM and Early REM exclude prior content for novelty; NREM does not.
|
||||
nrem_chunks = retrieve("nrem", excluded_sources=None,
|
||||
type_filter=type_filter, signal=signal)
|
||||
session_retrieved.update(c["source"] for c in nrem_chunks)
|
||||
# Track sources that scored above Early REM ceiling — these are the only ones Early REM should exclude
|
||||
nrem_high_sources = {c["source"] for c in nrem_chunks if c["similarity"] > 0.55}
|
||||
if not nrem_chunks:
|
||||
print("[NREM] No suitable chunks — aborting pipeline")
|
||||
return None
|
||||
# Cursor bump: NREM is the consolidation stage. Each appearance increments
|
||||
# consolidation_count + updates last_consolidated_at, so the next dream's
|
||||
# observation sees these sources as less under-processed.
|
||||
_bump_consolidation_cursor(nrem_chunks)
|
||||
|
||||
print(f"[NREM] Retrieved {len(nrem_chunks)} chunks. Synthesizing...")
|
||||
nrem_output = synthesize_nrem(nrem_chunks)
|
||||
@@ -472,11 +804,15 @@ def dream_pipeline():
|
||||
"nrem": {
|
||||
"chunks_retrieved": len(nrem_chunks),
|
||||
"avg_similarity": round(sum(c["relevance"] for c in nrem_chunks) / len(nrem_chunks), 3),
|
||||
"query": "research fabrication teaching practice recent work",
|
||||
"query": "[llm-generated from observation signal]",
|
||||
"word_count": len(nrem_output.split()),
|
||||
"sources": nrem_sources,
|
||||
"distinct_folders": nrem_folders,
|
||||
"folder_count": len(nrem_folders),
|
||||
# Counter filters None: Graphiti chunks lack `type` (facts, not embeddings rows).
|
||||
# Pgvector chunks always carry type post-Improvement-#2 backfill. If type
|
||||
# ever appears as None here, the backfill or writer enforcement has regressed.
|
||||
"type_distribution": dict(Counter(c.get("type") for c in nrem_chunks if c.get("type"))),
|
||||
"status": "ok",
|
||||
}
|
||||
}
|
||||
@@ -486,7 +822,8 @@ def dream_pipeline():
|
||||
print("\n[Early REM] Retrieving...")
|
||||
# Early REM excludes previously retrieved + NREM high-scorers only (not full session_retrieved)
|
||||
# Sources that scored in Early REM band during NREM remain available
|
||||
early_chunks = retrieve("early-rem", excluded_sources=previously_retrieved | nrem_high_sources)
|
||||
early_chunks = retrieve("early-rem", excluded_sources=nrem_high_sources,
|
||||
type_filter=type_filter, signal=signal)
|
||||
session_retrieved.update(c["source"] for c in early_chunks)
|
||||
if not early_chunks:
|
||||
print("[Early REM] No suitable chunks — skipping")
|
||||
@@ -500,18 +837,20 @@ def dream_pipeline():
|
||||
stage_data["early_rem"] = {
|
||||
"chunks_retrieved": len(early_chunks),
|
||||
"avg_similarity": round(sum(c["relevance"] for c in early_chunks) / len(early_chunks), 3),
|
||||
"query": "career decision personal change what matters next",
|
||||
"query": "[llm-generated from observation signal]",
|
||||
"word_count": len(early_rem_output.split()),
|
||||
"sources": early_sources,
|
||||
"distinct_folders": early_folders,
|
||||
"folder_count": len(early_folders),
|
||||
"type_distribution": dict(Counter(c.get("type") for c in early_chunks if c.get("type"))),
|
||||
"status": "ok",
|
||||
}
|
||||
print(f"[Early REM] Done.\n{early_rem_output[:200]}...")
|
||||
|
||||
# ── Stage 3: Late REM — informed by NREM + Early REM ──────────────────
|
||||
print("\n[Late REM] Retrieving...")
|
||||
late_chunks = retrieve("late-rem", excluded_sources=previously_retrieved | session_retrieved)
|
||||
late_chunks = retrieve("late-rem", excluded_sources=session_retrieved,
|
||||
type_filter=type_filter, signal=signal)
|
||||
session_retrieved.update(c["source"] for c in late_chunks)
|
||||
if not late_chunks:
|
||||
print("[Late REM] No suitable chunks — skipping")
|
||||
@@ -530,12 +869,13 @@ def dream_pipeline():
|
||||
stage_data["late_rem"] = {
|
||||
"chunks_retrieved": len(late_chunks),
|
||||
"avg_similarity": round(sum(c["relevance"] for c in late_chunks) / len(late_chunks), 3),
|
||||
"query": "practice place memory making",
|
||||
"query": "[llm-generated from observation signal]",
|
||||
"word_count": len(late_rem_output.split()),
|
||||
"sources": late_sources,
|
||||
"distinct_folders": list(set(late_folders)),
|
||||
"folder_count": len(set(late_folders)),
|
||||
"cross_domain_pairs": cross_domain_pairs,
|
||||
"type_distribution": dict(Counter(c.get("type") for c in late_chunks if c.get("type"))),
|
||||
"status": "ok",
|
||||
}
|
||||
print(f"[Late REM] Done.\n{late_rem_output[:200]}...")
|
||||
@@ -557,8 +897,20 @@ def dream_pipeline():
|
||||
# Write manifest
|
||||
all_session_sources = list(session_retrieved)
|
||||
all_session_folders = list({extract_folder(s) for s in all_session_sources})
|
||||
total_chunks = 0
|
||||
pg = None
|
||||
try:
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute("SELECT COUNT(*) FROM embeddings")
|
||||
total_chunks = cur.fetchone()[0]
|
||||
except Exception as e:
|
||||
print(f"total_chunks query failed (non-critical): {e}")
|
||||
finally:
|
||||
if pg is not None:
|
||||
pg.close()
|
||||
corpus_data = {
|
||||
"total_chunks": delta.get("new_chunks", 0),
|
||||
"total_chunks": total_chunks,
|
||||
"new_chunks_since_last_dream": delta.get("new_chunks", 0),
|
||||
"days_since_last_dream": round(delta.get("days_since_dream", 0), 2),
|
||||
"substrate": "pgvector",
|
||||
@@ -570,18 +922,11 @@ def dream_pipeline():
|
||||
}
|
||||
write_manifest(datetime.now().strftime("%Y-%m-%d"), stage_data, corpus_data)
|
||||
|
||||
# Update state and notify
|
||||
state = load_dreamer_state()
|
||||
# Update state and notify (reuse state from start of pipeline; legacy key already popped)
|
||||
state["last_dream_timestamp"] = datetime.now().timestamp()
|
||||
state["last_dream_mode"] = "pipeline"
|
||||
state["last_dream_file"] = synthesis_file
|
||||
|
||||
# Accumulate retrieved sources across nights. Cap at 500, trim to 400 on overflow.
|
||||
all_retrieved = list(previously_retrieved | session_retrieved)
|
||||
if len(all_retrieved) > 500:
|
||||
all_retrieved = all_retrieved[-400:]
|
||||
state["retrieved_sources"] = all_retrieved
|
||||
|
||||
save_dreamer_state(state)
|
||||
|
||||
notify_sse("synthesis", synthesis_file.split("/")[-1])
|
||||
@@ -589,10 +934,10 @@ def dream_pipeline():
|
||||
return synthesis_file
|
||||
|
||||
|
||||
def dream_lucid(task):
|
||||
def dream_lucid(task, type_filter=None):
|
||||
"""On-demand lucid dream — single mode, used by Dream Now in settings."""
|
||||
print(f"Lucid dream starting — task: {task[:80] if task else 'none'}")
|
||||
chunks = retrieve("lucid", task=task)
|
||||
chunks = retrieve("lucid", task=task, type_filter=type_filter)
|
||||
if not chunks:
|
||||
print("No suitable chunks — aborting")
|
||||
return None
|
||||
@@ -614,13 +959,13 @@ def dream_lucid(task):
|
||||
return filepath
|
||||
|
||||
|
||||
def dream_single(mode, task=None):
|
||||
def dream_single(mode, task=None, type_filter=None):
|
||||
"""
|
||||
Single mode — used by Dream Now for non-lucid modes.
|
||||
Runs one stage independently (for testing/tuning individual stages).
|
||||
"""
|
||||
print(f"Single mode dream: {mode}")
|
||||
chunks = retrieve(mode, task=task)
|
||||
chunks = retrieve(mode, task=task, type_filter=type_filter)
|
||||
if not chunks:
|
||||
print("No suitable chunks — aborting")
|
||||
return None
|
||||
@@ -657,12 +1002,19 @@ if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="Aaron AI Dreamer")
|
||||
parser.add_argument("--mode", choices=["nrem", "early-rem", "late-rem", "lucid", "pipeline"])
|
||||
parser.add_argument("--task", type=str)
|
||||
parser.add_argument(
|
||||
"--type-filter", type=str, default=None,
|
||||
help="Comma-separated embeddings.type allowlist (e.g. 'document,aaronai_conversation'). "
|
||||
"Applies to pgvector retrieval only; Graphiti chunks are not filtered. "
|
||||
"Experimental — default is no filter, no behavior change.",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
type_filter = [t.strip() for t in args.type_filter.split(",")] if args.type_filter else None
|
||||
|
||||
if args.mode == "lucid":
|
||||
dream_lucid(args.task or "What should I be thinking about that I am not?")
|
||||
dream_lucid(args.task or "What should I be thinking about that I am not?", type_filter=type_filter)
|
||||
elif args.mode and args.mode != "pipeline":
|
||||
dream_single(args.mode, args.task)
|
||||
dream_single(args.mode, args.task, type_filter=type_filter)
|
||||
else:
|
||||
# Default: full pipeline
|
||||
dream_pipeline()
|
||||
dream_pipeline(type_filter=type_filter)
|
||||
|
||||
@@ -0,0 +1,235 @@
|
||||
"""
|
||||
Dreamer Stages 1 + 2 — Observe and Select.
|
||||
|
||||
Implements `dreamer-design-spec.md`'s Stage 1 (observe_corpus) and Stage 2
|
||||
(select_mode). These have been latent in dream.py — observe_corpus existed
|
||||
in skeletal form but its output was largely unused; select_mode did not
|
||||
exist at all. The dreamer always ran all stages with hardcoded queries.
|
||||
|
||||
Per spec (lines 27–34 of dreamer-design-spec.md):
|
||||
delta = observe_corpus()
|
||||
selected_mode = select_mode(delta, task, project)
|
||||
if selected_mode is None:
|
||||
return # nothing worth dreaming
|
||||
|
||||
The "returns None — dreamer goes quiet rather than manufacturing novelty"
|
||||
semantics (spec line 67) is the canonical answer to the repetition problem
|
||||
documented in birdai-dreamer-exclusion-finding-2026-05-02.md.
|
||||
|
||||
Grounded in:
|
||||
- Active Inference (Friston 2010, 2017) — observe error, choose action that
|
||||
minimizes free energy. The dreamer is a prediction-error machine; observe
|
||||
what's diverged from the model, dream about that.
|
||||
- Sleep stages (Stickgold 2005; Walker 2017; Diekelberg & Born 2010) — NREM
|
||||
for replay of new traces, REM for associative cross-cluster integration.
|
||||
- Sharp-wave ripples (Buzsáki, Wilson) — biology tags WHAT to replay
|
||||
(under-processed chunks); not uniform. Implemented via the consolidation
|
||||
cursor on the embeddings table.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import sqlite3
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
|
||||
from dotenv import load_dotenv
|
||||
import psycopg2
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
# ─── Paths ──────────────────────────────────────────────────────────────────
|
||||
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
CONVERSATIONS_DB = str(Path.home() / "aaronai" / "conversations.db")
|
||||
WATCHER_STATE = str(Path.home() / "aaronai" / "watcher_state.json")
|
||||
DREAMER_STATE = str(Path.home() / "aaronai" / "dreamer_state.json")
|
||||
JOURNAL_DAILY = "/home/aaron/nextcloud/data/data/aaron/files/Journal/Daily"
|
||||
|
||||
# ─── Thresholds ─────────────────────────────────────────────────────────────
|
||||
# Per spec, these become settings-panel controls eventually. For now they're
|
||||
# constants here; moving them to a config module is task #48.
|
||||
|
||||
NEW_CHUNK_THRESHOLD = 5 # below this, NREM not warranted on novelty alone
|
||||
STALENESS_TRIGGER_DAYS = 3 # corpus quiet ≥3 days → Late REM ("shake things loose")
|
||||
QUESTION_LOOKBACK_DAYS = 14 # spec line 61: "the last 14 days"
|
||||
UNDERPROCESSED_PERCENTILE = 0.25 # bottom quartile of consolidation_count
|
||||
|
||||
|
||||
# ─── Helpers ────────────────────────────────────────────────────────────────
|
||||
|
||||
def _get_pg():
|
||||
return psycopg2.connect(PG_DSN)
|
||||
|
||||
|
||||
def _load_json(path, default):
|
||||
try:
|
||||
return json.loads(Path(path).read_text())
|
||||
except Exception:
|
||||
return default
|
||||
|
||||
|
||||
def _recent_user_questions(days=QUESTION_LOOKBACK_DAYS, limit=20):
|
||||
"""Pull recent user-turn content from conversations.db. The spec calls
|
||||
these 'live questions' — what Aaron has been asking about. They become
|
||||
seed material for the REM modes."""
|
||||
try:
|
||||
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||
cutoff = (datetime.now() - timedelta(days=days)).isoformat()
|
||||
cur = conn.cursor()
|
||||
cur.execute(
|
||||
"""
|
||||
SELECT m.content FROM messages m
|
||||
JOIN conversations c ON m.conversation_id = c.id
|
||||
WHERE m.role = 'user' AND c.updated_at > ?
|
||||
ORDER BY m.timestamp DESC LIMIT ?
|
||||
""",
|
||||
(cutoff, limit),
|
||||
)
|
||||
rows = cur.fetchall()
|
||||
conn.close()
|
||||
return [r[0][:280] for r in rows]
|
||||
except Exception:
|
||||
return []
|
||||
|
||||
|
||||
def _new_journal_entries(since_ts):
|
||||
"""Files in Journal/Daily/ created or modified since the last dream.
|
||||
Journal entries with emotional/personal register route to Early REM per
|
||||
the spec (line 71)."""
|
||||
journal_path = Path(JOURNAL_DAILY)
|
||||
if not journal_path.exists():
|
||||
return []
|
||||
new = []
|
||||
for p in journal_path.rglob("*.md"):
|
||||
try:
|
||||
if p.stat().st_mtime > since_ts:
|
||||
new.append(str(p.relative_to(journal_path)))
|
||||
except OSError:
|
||||
continue
|
||||
return new
|
||||
|
||||
|
||||
def _new_chunks_count(since_ts):
|
||||
"""Files in the watcher state with mtime > last_dream. The spec calls
|
||||
this 'what changed' (line 58). Used as the NREM novelty signal."""
|
||||
state = _load_json(WATCHER_STATE, {})
|
||||
count = 0
|
||||
for _path, mtime in state.items():
|
||||
try:
|
||||
if float(mtime) > since_ts:
|
||||
count += 1
|
||||
except (ValueError, TypeError):
|
||||
continue
|
||||
return count
|
||||
|
||||
|
||||
def _underprocessed_chunk_count():
|
||||
"""Chunks below the underprocessed percentile by consolidation_count.
|
||||
Biologically motivated: sharp-wave ripples bias replay toward novel /
|
||||
under-encoded experience, not uniform sampling. We give NREM a pool of
|
||||
'least-replayed' chunks to draw from in Stage 3."""
|
||||
try:
|
||||
pg = _get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute(
|
||||
"""
|
||||
WITH t AS (
|
||||
SELECT percentile_cont(%s) WITHIN GROUP (ORDER BY consolidation_count)
|
||||
AS threshold
|
||||
FROM embeddings
|
||||
)
|
||||
SELECT COUNT(*) FROM embeddings, t
|
||||
WHERE consolidation_count <= t.threshold
|
||||
""",
|
||||
(UNDERPROCESSED_PERCENTILE,),
|
||||
)
|
||||
result = cur.fetchone()[0]
|
||||
pg.close()
|
||||
return int(result or 0)
|
||||
except Exception:
|
||||
return 0
|
||||
|
||||
|
||||
# ─── Stage 1: observe_corpus ────────────────────────────────────────────────
|
||||
|
||||
def observe_corpus():
|
||||
"""Build the signal vector consumed by select_mode and (downstream) by
|
||||
retrieve. Concrete observations only — no interpretation. Each key is
|
||||
a direct measurement from the corpus, watcher, journal, or conversation
|
||||
log.
|
||||
|
||||
Returns a dict with:
|
||||
now_ts -- current Unix timestamp
|
||||
last_dream_ts -- last completed dream timestamp (0 if never)
|
||||
days_since_dream -- float; inf if never dreamed
|
||||
new_chunks -- count of files newer than last_dream
|
||||
new_journal_entries -- list of Journal/Daily/*.md filenames since last_dream
|
||||
recent_questions -- user-turn content from last 14 days
|
||||
underprocessed_count -- chunks in the bottom 25% by consolidation_count
|
||||
"""
|
||||
state = _load_json(DREAMER_STATE, {})
|
||||
last_dream_ts = float(state.get("last_dream_timestamp", 0) or 0)
|
||||
now_ts = datetime.now().timestamp()
|
||||
|
||||
return {
|
||||
"now_ts": now_ts,
|
||||
"last_dream_ts": last_dream_ts,
|
||||
"days_since_dream": (now_ts - last_dream_ts) / 86400 if last_dream_ts else float("inf"),
|
||||
"new_chunks": _new_chunks_count(last_dream_ts),
|
||||
"new_journal_entries": _new_journal_entries(last_dream_ts),
|
||||
"recent_questions": _recent_user_questions(),
|
||||
"underprocessed_count": _underprocessed_chunk_count(),
|
||||
}
|
||||
|
||||
|
||||
# ─── Stage 2: select_mode ───────────────────────────────────────────────────
|
||||
|
||||
def select_mode(signal, task=None, explicit_mode=None):
|
||||
"""Return one of {'nrem', 'early-rem', 'late-rem', 'lucid'}. Never None.
|
||||
|
||||
The dreamer fires every scheduled night. The earlier "go quiet on null
|
||||
delta" rule was a synthesis-doc invention that didn't match the actual
|
||||
desired UX — the original dreamer always dreamed, even if it repeated
|
||||
itself. The cure for repetition lives in the retrieve layer
|
||||
(LLM-generated queries from the observation signal, MMR diversity,
|
||||
cursor bias toward under-processed chunks), not in skipping nights.
|
||||
|
||||
Routing logic:
|
||||
- explicit_mode argument wins
|
||||
- task supplied → 'lucid' (question-anchored)
|
||||
- days_since_dream ≥ STALENESS_TRIGGER_DAYS → 'late-rem' (shake loose
|
||||
via cross-domain pairs when nothing's been added in a while)
|
||||
- new journal entry → 'early-rem' (emotional/personal register)
|
||||
- default → 'nrem' (replay-and-consolidation; always has something to
|
||||
do because the corpus always has under-processed chunks)
|
||||
"""
|
||||
if explicit_mode:
|
||||
return explicit_mode
|
||||
if task:
|
||||
return "lucid"
|
||||
|
||||
days_since = signal["days_since_dream"]
|
||||
new_journal = signal["new_journal_entries"]
|
||||
|
||||
if days_since >= STALENESS_TRIGGER_DAYS:
|
||||
return "late-rem"
|
||||
|
||||
if new_journal:
|
||||
return "early-rem"
|
||||
|
||||
return "nrem"
|
||||
|
||||
|
||||
# ─── CLI for manual inspection ──────────────────────────────────────────────
|
||||
|
||||
if __name__ == "__main__":
|
||||
signal = observe_corpus()
|
||||
short = {k: v for k, v in signal.items() if k != "recent_questions"}
|
||||
print("Signal (excluding recent_questions):")
|
||||
print(json.dumps(short, indent=2, default=str))
|
||||
print(f"\nRecent user questions ({len(signal['recent_questions'])}):")
|
||||
for q in signal["recent_questions"][:5]:
|
||||
print(f" - {q[:140]}")
|
||||
mode = select_mode(signal)
|
||||
print(f"\nselect_mode() → {mode!r}")
|
||||
@@ -0,0 +1,331 @@
|
||||
"""
|
||||
Aaron AI Stage 1 encoding helpers — single canonical implementation of:
|
||||
- extract_blocks(filepath) — section-aware extraction (docx heading-bounded
|
||||
sections, pptx per-slide, pdf/txt/md single-block)
|
||||
- extract_text(filepath) — back-compat string concatenation over blocks
|
||||
- chunk_text(text, chunk_size, overlap) — word-based blind chunking
|
||||
- chunk_and_embed(text_or_blocks, source, embedder, filepath, folder) —
|
||||
produce ready-to-write rows. Accepts str (blind) or list[dict] (section-aware).
|
||||
- write_embeddings_batch(conn, batch) — server-side NOW() canonical INSERT
|
||||
|
||||
Used by watcher.py, ingest.py, corpus_integrity.py, and api.py /api/corpus/retry.
|
||||
"""
|
||||
|
||||
import hashlib
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
from pathlib import Path
|
||||
|
||||
from docx import Document as DocxDocument
|
||||
from pypdf import PdfReader
|
||||
from pptx import Presentation
|
||||
|
||||
log = logging.getLogger("encoding")
|
||||
|
||||
SUPPORTED = {".docx", ".pdf", ".pptx", ".txt", ".md"}
|
||||
DEFAULT_CHUNK_SIZE = 500
|
||||
DEFAULT_CHUNK_OVERLAP = 50
|
||||
|
||||
_BOLD_KV_RE = re.compile(r"^\*\*[\w +/-]+?:\*\*")
|
||||
|
||||
|
||||
def _strip_md_frontmatter(text: str) -> str:
|
||||
"""Strip a leading frontmatter block from markdown, if present.
|
||||
|
||||
Recognizes two formats:
|
||||
- YAML-style: file's first non-empty line is `---`, terminated by `---`.
|
||||
Only triggered when no heading precedes — guards against `---`
|
||||
horizontal rules that follow an H1.
|
||||
- Capture-style: optional H1 heading, then one or more `**key:** value`
|
||||
lines (and blanks), terminated by `---`. The H1 is preserved; the
|
||||
key/value block + separator are removed.
|
||||
|
||||
Body `---` rules and body `**bold:**` lines are never touched — the scan
|
||||
aborts as soon as a non-frontmatter line appears in the leading block.
|
||||
"""
|
||||
lines = text.splitlines()
|
||||
n = len(lines)
|
||||
i = 0
|
||||
while i < n and not lines[i].strip():
|
||||
i += 1
|
||||
heading = None
|
||||
if i < n and lines[i].startswith("# "):
|
||||
heading = lines[i]
|
||||
i += 1
|
||||
while i < n and not lines[i].strip():
|
||||
i += 1
|
||||
if i >= n:
|
||||
return text
|
||||
first = lines[i].strip()
|
||||
if heading is None and first == "---":
|
||||
j = i + 1
|
||||
while j < n and lines[j].strip() != "---":
|
||||
j += 1
|
||||
if j >= n:
|
||||
return text
|
||||
body_start = j + 1
|
||||
elif _BOLD_KV_RE.match(first):
|
||||
j = i
|
||||
while j < n:
|
||||
s = lines[j].strip()
|
||||
if not s or _BOLD_KV_RE.match(s):
|
||||
j += 1
|
||||
continue
|
||||
if s == "---":
|
||||
body_start = j + 1
|
||||
break
|
||||
return text
|
||||
else:
|
||||
return text
|
||||
else:
|
||||
return text
|
||||
body = "\n".join(lines[body_start:]).lstrip("\n")
|
||||
return f"{heading}\n\n{body}" if heading else body
|
||||
|
||||
|
||||
def _docx_cell_paragraphs(cell):
|
||||
yield from (p for p in cell.paragraphs if p.text.strip())
|
||||
for nested in cell.tables:
|
||||
for row in nested.rows:
|
||||
for c in row.cells:
|
||||
yield from _docx_cell_paragraphs(c)
|
||||
|
||||
|
||||
def _pptx_shape_text(shape):
|
||||
from pptx.enum.shapes import MSO_SHAPE_TYPE
|
||||
parts = []
|
||||
if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
|
||||
for sub in shape.shapes:
|
||||
parts.extend(_pptx_shape_text(sub))
|
||||
return parts
|
||||
if hasattr(shape, "text") and shape.text.strip():
|
||||
parts.append(shape.text)
|
||||
if getattr(shape, "has_table", False):
|
||||
for cell in shape.table.iter_cells():
|
||||
if cell.text.strip():
|
||||
parts.append(cell.text)
|
||||
return parts
|
||||
|
||||
|
||||
def _extract_docx_blocks(filepath: Path) -> list[dict]:
|
||||
"""Return docx content as a single block. Earlier attempt at section-aware
|
||||
chunking via Heading styles was rolled back: the user's docs are mostly
|
||||
Normal-styled with bold-as-heading, and tying chunk boundaries to formatting
|
||||
choices locks future-them into preserving those choices forever. Lexical
|
||||
+ cross-encoder retrieval already finds the right substrings within a
|
||||
blind-chunked CV, so the section structure isn't load-bearing for retrieval."""
|
||||
from docx.oxml.ns import qn
|
||||
|
||||
doc = DocxDocument(filepath)
|
||||
parts = [p.text for p in doc.paragraphs if p.text.strip()]
|
||||
for tbl in doc.tables:
|
||||
for row in tbl.rows:
|
||||
for cell in row.cells:
|
||||
parts.extend(p.text for p in _docx_cell_paragraphs(cell))
|
||||
for section in doc.sections:
|
||||
parts.extend(p.text for p in section.header.paragraphs if p.text.strip())
|
||||
parts.extend(p.text for p in section.footer.paragraphs if p.text.strip())
|
||||
for txbx in doc.element.body.findall(".//" + qn("w:txbxContent")):
|
||||
for p in txbx.findall(".//" + qn("w:p")):
|
||||
text = "".join(t.text or "" for t in p.findall(".//" + qn("w:t")))
|
||||
if text.strip():
|
||||
parts.append(text)
|
||||
text = "\n".join(parts)
|
||||
return [{"heading": None, "text": text, "kind": "doc"}] if text.strip() else []
|
||||
|
||||
|
||||
def _extract_pptx_blocks(filepath: Path) -> list[dict]:
|
||||
"""One block per slide. Heading = slide title (or 'Slide N' fallback).
|
||||
Body = non-title shape text + speaker notes."""
|
||||
prs = Presentation(filepath)
|
||||
blocks = []
|
||||
for i, slide in enumerate(prs.slides, 1):
|
||||
title_shape = None
|
||||
try:
|
||||
title_shape = slide.shapes.title
|
||||
except (AttributeError, KeyError):
|
||||
pass
|
||||
title = None
|
||||
body_parts = []
|
||||
for shape in slide.shapes:
|
||||
if title_shape is not None and shape == title_shape and shape.has_text_frame:
|
||||
title = shape.text_frame.text.strip() or None
|
||||
continue
|
||||
body_parts.extend(_pptx_shape_text(shape))
|
||||
if slide.has_notes_slide:
|
||||
notes = slide.notes_slide.notes_text_frame.text
|
||||
if notes.strip():
|
||||
body_parts.append(f"[Notes] {notes}")
|
||||
if title or body_parts:
|
||||
blocks.append({
|
||||
"heading": title or f"Slide {i}",
|
||||
"text": "\n".join(body_parts),
|
||||
"kind": "slide",
|
||||
})
|
||||
return blocks
|
||||
|
||||
|
||||
def extract_blocks(filepath: Path) -> list[dict]:
|
||||
"""Structured extraction. Returns list of {heading, text, kind} blocks.
|
||||
|
||||
- docx: section-aware via Heading-style paragraphs (kind='section').
|
||||
- pptx: one block per slide (kind='slide').
|
||||
- pdf/txt/md: single block, no heading (kind='doc').
|
||||
|
||||
Empty list on any failure or unsupported extension."""
|
||||
suffix = filepath.suffix.lower()
|
||||
try:
|
||||
if suffix == ".docx":
|
||||
return _extract_docx_blocks(filepath)
|
||||
if suffix == ".pptx":
|
||||
return _extract_pptx_blocks(filepath)
|
||||
if suffix == ".pdf":
|
||||
reader = PdfReader(filepath)
|
||||
text = "".join(
|
||||
page.extract_text() + "\n"
|
||||
for page in reader.pages if page.extract_text()
|
||||
)
|
||||
return [{"heading": None, "text": text, "kind": "doc"}] if text.strip() else []
|
||||
if suffix in {".txt", ".md"}:
|
||||
text = filepath.read_text(encoding="utf-8", errors="ignore")
|
||||
if suffix == ".md":
|
||||
text = _strip_md_frontmatter(text)
|
||||
return [{"heading": None, "text": text, "kind": "doc"}] if text.strip() else []
|
||||
except Exception as e:
|
||||
log.warning(f"Extraction failed for {filepath.name}: {e}")
|
||||
return []
|
||||
|
||||
|
||||
def extract_text(filepath: Path) -> str:
|
||||
"""Back-compat wrapper: concatenate extract_blocks() output. Section
|
||||
structure is lost; use extract_blocks() directly for chunking."""
|
||||
blocks = extract_blocks(filepath)
|
||||
parts = []
|
||||
for b in blocks:
|
||||
if b.get("heading"):
|
||||
parts.append(b["heading"])
|
||||
if b.get("text"):
|
||||
parts.append(b["text"])
|
||||
return "\n".join(parts)
|
||||
|
||||
|
||||
def chunk_text(text: str,
|
||||
chunk_size: int = DEFAULT_CHUNK_SIZE,
|
||||
overlap: int = DEFAULT_CHUNK_OVERLAP) -> list[str]:
|
||||
"""Word-based chunking. Empty chunks filtered."""
|
||||
words = text.split()
|
||||
chunks = []
|
||||
start = 0
|
||||
while start < len(words):
|
||||
chunk = " ".join(words[start:start + chunk_size])
|
||||
if chunk.strip():
|
||||
chunks.append(chunk)
|
||||
start += chunk_size - overlap
|
||||
return chunks
|
||||
|
||||
|
||||
def _chunk_id(filepath, source: str, index: int) -> str:
|
||||
basis = str(filepath) if filepath else source
|
||||
return f"{hashlib.md5(basis.encode()).hexdigest()[:8]}_{index}"
|
||||
|
||||
|
||||
def chunk_and_embed(text_or_blocks,
|
||||
source: str,
|
||||
embedder,
|
||||
filepath=None,
|
||||
folder=None) -> list[dict]:
|
||||
"""Chunk + embed for write_embeddings_batch. Accepts either:
|
||||
|
||||
- str: blind chunking with 500-word windows (pdf/txt/md legacy path).
|
||||
- list[dict]: section-aware path (docx Heading-bounded sections, pptx
|
||||
slides). Each block emits one chunk if its text fits within
|
||||
DEFAULT_CHUNK_SIZE words, otherwise is blind-split with overlap.
|
||||
|
||||
The block heading is prepended to the chunk text (so retrieval sees the
|
||||
section context) and stored in metadata as heading/kind."""
|
||||
if isinstance(text_or_blocks, str):
|
||||
blocks = [{"heading": None, "text": text_or_blocks, "kind": "doc"}]
|
||||
else:
|
||||
blocks = text_or_blocks
|
||||
|
||||
chunks = []
|
||||
for block in blocks:
|
||||
body = block.get("text") or ""
|
||||
heading = block.get("heading")
|
||||
kind = block.get("kind", "doc")
|
||||
if not body.strip() and not (heading and heading.strip()):
|
||||
continue
|
||||
if heading and body.strip():
|
||||
contextualized = f"{heading}\n\n{body}"
|
||||
elif heading:
|
||||
contextualized = heading
|
||||
else:
|
||||
contextualized = body
|
||||
if len(contextualized.split()) <= DEFAULT_CHUNK_SIZE:
|
||||
chunks.append((contextualized, heading, kind))
|
||||
else:
|
||||
for sub in chunk_text(contextualized):
|
||||
chunks.append((sub, heading, kind))
|
||||
|
||||
if not chunks:
|
||||
return []
|
||||
embeddings = embedder.encode([c[0] for c in chunks]).tolist()
|
||||
rows = []
|
||||
for i, ((chunk, heading, kind), emb) in enumerate(zip(chunks, embeddings)):
|
||||
rows.append({
|
||||
"id": _chunk_id(filepath, source, i),
|
||||
"document": chunk,
|
||||
"embedding": emb,
|
||||
"source": source,
|
||||
"type": "document",
|
||||
"metadata": {
|
||||
"source": source,
|
||||
"filepath": str(filepath) if filepath else source,
|
||||
"folder": folder,
|
||||
"heading": heading,
|
||||
"kind": kind,
|
||||
},
|
||||
})
|
||||
return rows
|
||||
|
||||
|
||||
def write_embeddings_batch(conn, batch: list[dict], commit: bool = True) -> int:
|
||||
"""Single canonical INSERT. Sets created_at = NOW() server-side.
|
||||
|
||||
Every row dict must supply 'type'. created_at is SQL-supplied (NOW()), so
|
||||
callers do not need to provide it. The application-layer assertion is the
|
||||
primary enforcement point for type — the column lacks NOT NULL because
|
||||
historical NULLs were resolved by the Improvement #2 backfill, and a
|
||||
Python-level raise gives a faster, more debuggable failure than a
|
||||
Postgres constraint error.
|
||||
|
||||
When commit=True (default), this function commits the connection itself.
|
||||
When commit=False, the caller is responsible for committing. Use
|
||||
commit=False when composing this write with other writes that must land
|
||||
atomically in the same transaction.
|
||||
"""
|
||||
if not batch:
|
||||
return 0
|
||||
cur = conn.cursor()
|
||||
for row in batch:
|
||||
if not row.get("type"):
|
||||
raise ValueError(
|
||||
f"row {row.get('id')!r} missing 'type'; writers must supply it "
|
||||
f"(see Improvement #2 in docs/birdai-component-inventory)"
|
||||
)
|
||||
cur.execute("""
|
||||
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
|
||||
VALUES (%s, %s, %s::vector, %s, %s, NOW(), %s)
|
||||
ON CONFLICT (id) DO UPDATE SET
|
||||
document = EXCLUDED.document,
|
||||
embedding = EXCLUDED.embedding,
|
||||
source = EXCLUDED.source,
|
||||
type = EXCLUDED.type,
|
||||
created_at = COALESCE(embeddings.created_at, EXCLUDED.created_at),
|
||||
metadata = EXCLUDED.metadata
|
||||
""", (row["id"], row["document"], row["embedding"],
|
||||
row["source"], row["type"], json.dumps(row["metadata"])))
|
||||
if commit:
|
||||
conn.commit()
|
||||
return len(batch)
|
||||
@@ -0,0 +1,193 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Audit Expansion Pack Generator — type-aware stratified draw of 12
|
||||
documents from base_class_validation_results.json for n=20 audit expansion.
|
||||
|
||||
Per audit-expansion-protocol.md amendment 2026-04-28:
|
||||
The seed=43 length-only random draw concentrated on course modules in the
|
||||
small and medium buckets, missing voice captures, syllabi, and
|
||||
conversational documents present in the candidate distribution.
|
||||
This script implements type-aware stratification within each length
|
||||
bucket to produce a sample representative of BirdAI's document-type mix.
|
||||
|
||||
Targets (12 total):
|
||||
small (4): 2 course_module + 2 voice_capture
|
||||
medium (4): 2 course_module + 1 syllabus + 1 other
|
||||
large (4): 1 course_ppt + 1 syllabus + 1 faculty_report + 1 conversational
|
||||
|
||||
Output: ~/aaronai/experiments/audit_expansion_pack.json
|
||||
|
||||
Usage:
|
||||
python3 ~/aaronai/scripts/audit_expansion_draw.py
|
||||
python3 ~/aaronai/scripts/audit_expansion_draw.py --dry-run
|
||||
"""
|
||||
import argparse
|
||||
import json
|
||||
import random
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
EXPERIMENTS = Path.home() / "aaronai" / "experiments"
|
||||
VALIDATION_RESULTS = EXPERIMENTS / "base_class_validation_results.json"
|
||||
EXISTING_AUDIT_PACK = EXPERIMENTS / "base_class_audit_pack.json"
|
||||
OUTPUT_FILE = EXPERIMENTS / "audit_expansion_pack.json"
|
||||
|
||||
SEED = 43
|
||||
|
||||
# Type-aware targets per bucket
|
||||
TYPE_TARGETS = {
|
||||
"small": {"course_module": 2, "voice_capture": 2},
|
||||
"medium": {"course_module": 2, "syllabus": 1, "other": 1},
|
||||
"large": {"course_ppt": 1, "syllabus": 1, "faculty_report": 1, "conversational": 1},
|
||||
}
|
||||
|
||||
|
||||
def classify(source, bucket):
|
||||
"""Map a source filename to a document type, scoped to bucket where
|
||||
type categories overlap (e.g., 'course_module' vs 'course_ppt')."""
|
||||
s = source.lower()
|
||||
|
||||
# Voice captures — pattern: YYYY-MM-DD-HH-MM-voice.md
|
||||
if re.match(r"\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-voice\.md$", source):
|
||||
return "voice_capture"
|
||||
|
||||
# Conversational exports — pattern: "Claude: ..." or "ChatGPT: ..."
|
||||
if source.startswith("Claude:") or source.startswith("ChatGPT:"):
|
||||
return "conversational"
|
||||
|
||||
# Syllabus — must contain "syllabus" in the name
|
||||
if "syllabus" in s:
|
||||
return "syllabus"
|
||||
|
||||
# Faculty / annual reports
|
||||
if "faculty report" in s or "annual report" in s:
|
||||
return "faculty_report"
|
||||
|
||||
# Course PPTs (large bucket) — pattern: "_PPT_" or "_v3.pptx" or "Mod0N_"
|
||||
if bucket == "large" and (".pptx" in s or "_ppt_" in s or re.match(r"mod\d+_", s)):
|
||||
return "course_ppt"
|
||||
|
||||
# Course modules (small/medium bucket) — pattern: "0N_*.docx" or numeric prefix
|
||||
if re.match(r"^\d{2}_", source):
|
||||
return "course_module"
|
||||
|
||||
# Everything else falls into 'other' for medium; not used in small/large targets
|
||||
return "other"
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
if not VALIDATION_RESULTS.exists():
|
||||
print(f"ERROR: {VALIDATION_RESULTS} not found", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
with open(VALIDATION_RESULTS) as f:
|
||||
validation = json.load(f)
|
||||
|
||||
all_docs = validation["results"]
|
||||
print(f"Loaded {len(all_docs)} documents from validation results")
|
||||
print(f"Experiment: {validation.get('title', 'unknown')}")
|
||||
|
||||
# Load existing audit pack to exclude its sources (audit pack uses 'pairs')
|
||||
excluded_sources = set()
|
||||
if EXISTING_AUDIT_PACK.exists():
|
||||
with open(EXISTING_AUDIT_PACK) as f:
|
||||
existing = json.load(f)
|
||||
existing_pairs = existing.get("pairs", existing.get("results", existing))
|
||||
for doc in existing_pairs:
|
||||
src = doc.get("source")
|
||||
if src:
|
||||
excluded_sources.add(src)
|
||||
print(f"Excluding {len(excluded_sources)} sources already in audit pack")
|
||||
|
||||
# Filter to valid candidates
|
||||
valid_docs = []
|
||||
for doc in all_docs:
|
||||
src = doc.get("source")
|
||||
if src in excluded_sources:
|
||||
continue
|
||||
if not doc.get("condition_a") or not doc.get("condition_b"):
|
||||
continue
|
||||
bucket = doc.get("size_bucket")
|
||||
if bucket not in TYPE_TARGETS:
|
||||
continue
|
||||
doc["_type"] = classify(src, bucket)
|
||||
valid_docs.append(doc)
|
||||
|
||||
print(f"Valid candidate documents: {len(valid_docs)}")
|
||||
|
||||
# Print what's available per (bucket, type) before drawing
|
||||
print(f"\nCandidates by (bucket, type):")
|
||||
for bucket in TYPE_TARGETS:
|
||||
bucket_docs = [d for d in valid_docs if d["size_bucket"] == bucket]
|
||||
types_in_bucket = {}
|
||||
for d in bucket_docs:
|
||||
types_in_bucket.setdefault(d["_type"], []).append(d)
|
||||
print(f" {bucket}:")
|
||||
for t in sorted(types_in_bucket.keys()):
|
||||
target = TYPE_TARGETS[bucket].get(t, "—")
|
||||
print(f" {t:>16}: {len(types_in_bucket[t])} avail, target {target}")
|
||||
|
||||
# Stratified type-aware draw
|
||||
random.seed(SEED)
|
||||
drawn = []
|
||||
warnings = []
|
||||
for bucket, type_targets in TYPE_TARGETS.items():
|
||||
bucket_docs = [d for d in valid_docs if d["size_bucket"] == bucket]
|
||||
for doc_type, target in type_targets.items():
|
||||
type_docs = [d for d in bucket_docs if d["_type"] == doc_type]
|
||||
if len(type_docs) < target:
|
||||
msg = (f"WARNING: bucket={bucket} type={doc_type} "
|
||||
f"available={len(type_docs)} target={target}")
|
||||
warnings.append(msg)
|
||||
print(msg, file=sys.stderr)
|
||||
n_to_draw = min(target, len(type_docs))
|
||||
sample = random.sample(type_docs, n_to_draw)
|
||||
drawn.extend(sample)
|
||||
|
||||
# Report draw
|
||||
print(f"\nDrew {len(drawn)} documents:")
|
||||
for d in drawn:
|
||||
src = d.get("source", "<unknown>")
|
||||
chars = d.get("doc_chars_original", 0)
|
||||
bucket = d.get("size_bucket", "?")
|
||||
doc_type = d.get("_type", "?")
|
||||
truncated = " (TRUNCATED)" if d.get("truncated") else ""
|
||||
print(f" [{bucket:>6}/{doc_type:>16}] {chars:>6}c {src}{truncated}")
|
||||
|
||||
# Bucket-level summary
|
||||
bucket_counts = {"small": 0, "medium": 0, "large": 0}
|
||||
for d in drawn:
|
||||
bucket_counts[d["size_bucket"]] += 1
|
||||
print(f"\nBucket totals: {bucket_counts}")
|
||||
|
||||
if args.dry_run:
|
||||
print(f"\n--dry-run set, not writing output file")
|
||||
return
|
||||
|
||||
output = {
|
||||
"metadata": {
|
||||
"generated_at": time.strftime("%Y-%m-%dT%H:%M:%S"),
|
||||
"source_validation_file": str(VALIDATION_RESULTS),
|
||||
"seed": SEED,
|
||||
"stratification": "type-aware within length bucket",
|
||||
"type_targets": TYPE_TARGETS,
|
||||
"bucket_counts": bucket_counts,
|
||||
"excluded_count": len(excluded_sources),
|
||||
"warnings": warnings,
|
||||
"purpose": "n=20 audit expansion per audit-expansion-protocol.md (type-aware amendment)",
|
||||
},
|
||||
"results": drawn,
|
||||
}
|
||||
with open(OUTPUT_FILE, "w") as f:
|
||||
json.dump(output, f, indent=2, default=str)
|
||||
print(f"\nWrote {OUTPUT_FILE}")
|
||||
print(f" {len(drawn)} documents ready for rating")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,605 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Base-Class Enrichment Test — OOP Framing Experiment
|
||||
|
||||
Tests whether non-entity metadata from a local model (domain class, structural
|
||||
signals, presence flags, length, summary) can take load off the API without
|
||||
constraining what it extracts.
|
||||
|
||||
The local model does NOT draft entities. The API still does full extraction.
|
||||
The local model produces metadata that orients the API's reading.
|
||||
|
||||
Conditions:
|
||||
A — Baseline: single Claude Haiku call, full extraction, no metadata
|
||||
B — Base-class: Mistral metadata + Haiku full extraction with metadata as frame
|
||||
|
||||
Critical test: B's edge count and predicate diversity must be ≥A's, or close.
|
||||
If B produces fewer edges or less predicate diversity, metadata is acting as
|
||||
constraint and the OOP framing is falsified.
|
||||
|
||||
Sample: 50 docs from briefing_test_v2_results.json:
|
||||
- 15 small (<1000 chars)
|
||||
- 25 medium (1000-5000 chars)
|
||||
- 10 large (5000-12000 chars, capped at 12K)
|
||||
|
||||
Outputs: ~/aaronai/experiments/base_class_audit_rerun_results.json
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import statistics
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import anthropic
|
||||
import psycopg2
|
||||
import requests
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
V2_FILE = Path.home() / "aaronai" / "briefing_test_v2_results.json"
|
||||
OUTPUT_FILE = Path.home() / "aaronai" / "experiments" / "base_class_audit_rerun_results.json"
|
||||
HAIKU_MODEL = "claude-haiku-4-5-20251001"
|
||||
HAIKU_MAX_TOKENS = 8192
|
||||
HAIKU_TEMPERATURE = 0.0
|
||||
OLLAMA_URL = "http://localhost:11434/api/generate"
|
||||
LOCAL_MODEL = "mistral"
|
||||
LOCAL_TIMEOUT = 180
|
||||
MAX_DOC_CHARS = 12000
|
||||
|
||||
HAIKU_IN_PER_M = 1.0
|
||||
HAIKU_OUT_PER_M = 5.0
|
||||
|
||||
|
||||
CONDITION_A_PROMPT = """Extract a knowledge graph from the document below.
|
||||
|
||||
Return ONLY valid JSON with this exact schema:
|
||||
{
|
||||
"entities": [
|
||||
{"name": string, "type": string}
|
||||
],
|
||||
"edges": [
|
||||
{"subject": string, "predicate": string, "object": string}
|
||||
]
|
||||
}
|
||||
|
||||
Entity types: use whatever fits the entity. Do not constrain yourself to a fixed list.
|
||||
|
||||
Edge predicates: natural language phrases that capture the actual relationship the document states or implies.
|
||||
|
||||
Extract every entity and every relationship the document states or strongly implies. Both subject and object in every edge must appear in entities. JSON only, no commentary, no markdown fences.
|
||||
|
||||
DOCUMENT:
|
||||
"""
|
||||
|
||||
LOCAL_METADATA_PROMPT = """Analyze the document below and produce metadata describing its surface features. Do NOT extract entities. Do NOT identify content. Only produce structural and surface-level metadata.
|
||||
|
||||
Return ONLY valid JSON with this exact schema:
|
||||
{
|
||||
"language": "en or other",
|
||||
"char_length": integer,
|
||||
"primary_format": "prose, presentation, list, form, code, or mixed",
|
||||
"structural_signals": {
|
||||
"has_headings": boolean,
|
||||
"has_bullet_lists": boolean,
|
||||
"has_numbered_lists": boolean,
|
||||
"has_tables": boolean,
|
||||
"has_code_blocks": boolean,
|
||||
"has_dates": boolean
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": boolean,
|
||||
"has_institutional_language": boolean,
|
||||
"has_technical_terminology": boolean,
|
||||
"has_first_person": boolean,
|
||||
"has_quotations": boolean
|
||||
},
|
||||
"domain_class": "technical, administrative, personal, educational, creative, reference, or mixed",
|
||||
"one_sentence_summary": "string of 25 words or fewer describing what the document is about"
|
||||
}
|
||||
|
||||
JSON only, no commentary.
|
||||
|
||||
DOCUMENT:
|
||||
"""
|
||||
|
||||
CONDITION_B_API_PROMPT = """You are extracting a knowledge graph from a document. The document has been pre-analyzed by a local model and the following metadata is provided as orienting context — not as constraint. Extract every entity and every relationship in the document. Do not limit your extraction to what the metadata suggests; the metadata is here to orient your reading, not to bound it.
|
||||
|
||||
DOCUMENT METADATA:
|
||||
{metadata_json}
|
||||
|
||||
Return ONLY valid JSON with this exact schema:
|
||||
{
|
||||
"entities": [
|
||||
{"name": string, "type": string}
|
||||
],
|
||||
"edges": [
|
||||
{"subject": string, "predicate": string, "object": string}
|
||||
]
|
||||
}
|
||||
|
||||
Entity types: use whatever fits. Edge predicates: natural language phrases capturing the actual relationship. Both subject and object in every edge must appear in entities. Extract every entity and every relationship the document states or strongly implies. Do not filter for salience. JSON only, no commentary, no markdown fences.
|
||||
|
||||
DOCUMENT:
|
||||
"""
|
||||
|
||||
|
||||
def strip_json_fences(text):
|
||||
if not text:
|
||||
return ""
|
||||
t = text.strip()
|
||||
t = re.sub(r"^```(?:json)?\s*", "", t)
|
||||
t = re.sub(r"\s*```$", "", t)
|
||||
return t.strip()
|
||||
|
||||
|
||||
def fetch_document_text(pg_conn, source):
|
||||
cur = pg_conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT document FROM embeddings WHERE source = %s ORDER BY id",
|
||||
(source,),
|
||||
)
|
||||
rows = cur.fetchall()
|
||||
cur.close()
|
||||
if not rows:
|
||||
return None, 0
|
||||
full = "\n\n".join(r[0] for r in rows)
|
||||
return full[:MAX_DOC_CHARS], len(full)
|
||||
|
||||
|
||||
def call_haiku(client, prompt_text):
|
||||
t0 = time.time()
|
||||
resp = client.messages.create(
|
||||
model=HAIKU_MODEL,
|
||||
max_tokens=HAIKU_MAX_TOKENS,
|
||||
temperature=HAIKU_TEMPERATURE,
|
||||
messages=[{"role": "user", "content": prompt_text}],
|
||||
)
|
||||
return {
|
||||
"input_tokens": resp.usage.input_tokens,
|
||||
"output_tokens": resp.usage.output_tokens,
|
||||
"latency_s": round(time.time() - t0, 2),
|
||||
"response_text": resp.content[0].text if resp.content else "",
|
||||
"stop_reason": resp.stop_reason,
|
||||
}
|
||||
|
||||
|
||||
def call_local_metadata(document_text):
|
||||
t0 = time.time()
|
||||
try:
|
||||
resp = requests.post(
|
||||
OLLAMA_URL,
|
||||
json={
|
||||
"model": LOCAL_MODEL,
|
||||
"prompt": LOCAL_METADATA_PROMPT + document_text,
|
||||
"stream": False,
|
||||
"format": "json",
|
||||
"options": {"num_predict": 1024, "temperature": 0, "num_ctx": 12288},
|
||||
},
|
||||
timeout=LOCAL_TIMEOUT,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return {
|
||||
"response": resp.json().get("response", ""),
|
||||
"latency_s": round(time.time() - t0, 2),
|
||||
}
|
||||
except Exception as e:
|
||||
return {"error": str(e), "latency_s": round(time.time() - t0, 2)}
|
||||
|
||||
|
||||
def parse_graph_full(raw):
|
||||
"""Return (entities_list, edges_list, parsed_ok). Lists for metric computation."""
|
||||
cleaned = strip_json_fences(raw)
|
||||
if not cleaned:
|
||||
return None, None, False
|
||||
try:
|
||||
data = json.loads(cleaned)
|
||||
except json.JSONDecodeError:
|
||||
return None, None, False
|
||||
if not isinstance(data, dict):
|
||||
return None, None, False
|
||||
ents = data.get("entities")
|
||||
edges = data.get("edges")
|
||||
if isinstance(ents, list) and isinstance(edges, list):
|
||||
return ents, edges, True
|
||||
return None, None, False
|
||||
|
||||
|
||||
def parse_metadata(raw):
|
||||
cleaned = strip_json_fences(raw)
|
||||
if not cleaned:
|
||||
return None
|
||||
try:
|
||||
return json.loads(cleaned)
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
|
||||
|
||||
def graph_metrics(entities, edges):
|
||||
"""Compute graph quality metrics. Inputs are lists from parse_graph_full."""
|
||||
if entities is None or edges is None:
|
||||
return None
|
||||
n_entities = len(entities)
|
||||
n_edges = len(edges)
|
||||
|
||||
# Predicate diversity
|
||||
predicates = set()
|
||||
for e in edges:
|
||||
if isinstance(e, dict):
|
||||
p = e.get("predicate")
|
||||
if p:
|
||||
predicates.add(str(p).strip().lower())
|
||||
predicate_diversity = len(predicates)
|
||||
|
||||
# Entity type diversity
|
||||
types = set()
|
||||
for ent in entities:
|
||||
if isinstance(ent, dict):
|
||||
t = ent.get("type")
|
||||
if t:
|
||||
types.add(str(t).strip().lower())
|
||||
type_diversity = len(types)
|
||||
|
||||
# Average degree (edges*2 / entities — each edge touches two nodes)
|
||||
avg_degree = (2 * n_edges / n_entities) if n_entities > 0 else 0.0
|
||||
|
||||
# Largest connected component
|
||||
# Build adjacency from edges
|
||||
entity_names = set()
|
||||
for ent in entities:
|
||||
if isinstance(ent, dict):
|
||||
n = ent.get("name")
|
||||
if n:
|
||||
entity_names.add(str(n).strip().lower())
|
||||
|
||||
adj = {name: set() for name in entity_names}
|
||||
for e in edges:
|
||||
if not isinstance(e, dict):
|
||||
continue
|
||||
s = str(e.get("subject", "")).strip().lower()
|
||||
o = str(e.get("object", "")).strip().lower()
|
||||
if s in adj and o in adj:
|
||||
adj[s].add(o)
|
||||
adj[o].add(s)
|
||||
|
||||
# BFS for largest component
|
||||
visited = set()
|
||||
largest = 0
|
||||
for start in adj:
|
||||
if start in visited:
|
||||
continue
|
||||
component = 0
|
||||
stack = [start]
|
||||
while stack:
|
||||
node = stack.pop()
|
||||
if node in visited:
|
||||
continue
|
||||
visited.add(node)
|
||||
component += 1
|
||||
for neighbor in adj[node]:
|
||||
if neighbor not in visited:
|
||||
stack.append(neighbor)
|
||||
if component > largest:
|
||||
largest = component
|
||||
|
||||
return {
|
||||
"n_entities": n_entities,
|
||||
"n_edges": n_edges,
|
||||
"predicate_diversity": predicate_diversity,
|
||||
"type_diversity": type_diversity,
|
||||
"avg_degree": round(avg_degree, 2),
|
||||
"largest_component": largest,
|
||||
"largest_component_pct": round(100 * largest / n_entities, 1) if n_entities else 0.0,
|
||||
}
|
||||
|
||||
|
||||
def stratify(docs):
|
||||
"""Audit re-run: load the 10 audit docs from base_class_audit_pack.json."""
|
||||
import json as _json
|
||||
audit_file = Path.home() / "aaronai" / "experiments" / "base_class_audit_pack.json"
|
||||
if not audit_file.exists():
|
||||
print(f"ERROR: {audit_file} not found")
|
||||
return []
|
||||
audit = _json.loads(audit_file.read_text())
|
||||
audit_sources = [p["source"] for p in audit["pairs"]]
|
||||
|
||||
# Synthesize doc_meta entries for the audit sources
|
||||
sample = [{"source": s, "content_length": 0, "status": "SUCCESS"}
|
||||
for s in audit_sources]
|
||||
print(f"Audit re-run: {len(sample)} docs from base_class_audit_pack.json")
|
||||
return sample
|
||||
|
||||
|
||||
def fmt_metrics(m):
|
||||
if m is None:
|
||||
return "n/a"
|
||||
return (f"e={m['n_entities']} edge={m['n_edges']} "
|
||||
f"pred={m['predicate_diversity']} type={m['type_diversity']} "
|
||||
f"deg={m['avg_degree']} comp={m['largest_component']}/{m['n_entities']}")
|
||||
|
||||
|
||||
def main():
|
||||
api_key = os.environ.get("ANTHROPIC_API_KEY")
|
||||
pg_dsn = os.environ.get("PG_DSN")
|
||||
if not api_key or not pg_dsn:
|
||||
print("ERROR: ANTHROPIC_API_KEY or PG_DSN not set", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
if not V2_FILE.exists():
|
||||
print(f"ERROR: {V2_FILE} not found", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
with open(V2_FILE) as f:
|
||||
v2 = json.load(f)
|
||||
|
||||
docs_meta = [d for d in v2["documents"] if d.get("status") == "SUCCESS"]
|
||||
sample = stratify(docs_meta)
|
||||
print(f"Sample: {len(sample)} docs (15s/25m/10l, file order)")
|
||||
print(f"Mistral context: 12288 tokens, doc cap {MAX_DOC_CHARS} chars")
|
||||
print(f"Haiku model: {HAIKU_MODEL} temp={HAIKU_TEMPERATURE}")
|
||||
print(f"Test: base-class metadata as orienting frame, NOT entity drafting")
|
||||
print()
|
||||
|
||||
client = anthropic.Anthropic(api_key=api_key)
|
||||
pg_conn = psycopg2.connect(pg_dsn)
|
||||
|
||||
results = []
|
||||
started_at = datetime.now(timezone.utc).isoformat()
|
||||
t_total = time.time()
|
||||
|
||||
for i, doc_meta in enumerate(sample, 1):
|
||||
source = doc_meta["source"]
|
||||
doc_text, original_len = fetch_document_text(pg_conn, source)
|
||||
if not doc_text:
|
||||
print(f"[{i:02d}/{len(sample)}] {source[:55]} — SKIP (not in pgvector)")
|
||||
results.append({"source": source, "skipped": "not_in_pgvector"})
|
||||
continue
|
||||
|
||||
sent_len = len(doc_text)
|
||||
truncated = original_len > sent_len
|
||||
size_bucket = (
|
||||
"small" if sent_len < 1000
|
||||
else "medium" if sent_len < 5000
|
||||
else "large"
|
||||
)
|
||||
trunc_marker = "*" if truncated else " "
|
||||
print(f"[{i:02d}/{len(sample)}] [{size_bucket:6s}] [{sent_len:>5}c{trunc_marker}] {source[:55]}", flush=True)
|
||||
|
||||
# Condition A
|
||||
try:
|
||||
a = call_haiku(client, CONDITION_A_PROMPT + doc_text)
|
||||
a_ents, a_edges, a_ok = parse_graph_full(a["response_text"])
|
||||
a_metrics = graph_metrics(a_ents, a_edges) if a_ok else None
|
||||
print(f" A: in={a['input_tokens']} out={a['output_tokens']} "
|
||||
f"stop={a['stop_reason']} t={a['latency_s']}s", flush=True)
|
||||
print(f" {fmt_metrics(a_metrics)}", flush=True)
|
||||
except Exception as e:
|
||||
print(f" A FAILED: {e}", flush=True)
|
||||
a = {"error": str(e)}
|
||||
a_metrics = None
|
||||
|
||||
# Condition B local metadata pass
|
||||
local_result = call_local_metadata(doc_text)
|
||||
if "error" in local_result:
|
||||
print(f" B local FAILED: {local_result['error']}", flush=True)
|
||||
results.append({
|
||||
"source": source,
|
||||
"size_bucket": size_bucket,
|
||||
"doc_chars_original": original_len,
|
||||
"doc_chars_sent": sent_len,
|
||||
"truncated": truncated,
|
||||
"condition_a": {
|
||||
"input_tokens": a.get("input_tokens"),
|
||||
"output_tokens": a.get("output_tokens"),
|
||||
"latency_s": a.get("latency_s"),
|
||||
"metrics": a_metrics,
|
||||
"stop_reason": a.get("stop_reason"),
|
||||
"response_text": a.get("response_text", "")[:32000],
|
||||
"error": a.get("error"),
|
||||
},
|
||||
"condition_b": {
|
||||
"skipped": "local_model_failed",
|
||||
"local_error": local_result["error"],
|
||||
"local_latency_s": local_result.get("latency_s"),
|
||||
},
|
||||
})
|
||||
continue
|
||||
|
||||
local_raw = local_result["response"]
|
||||
metadata = parse_metadata(local_raw)
|
||||
# Override LLM-hallucinated char_length with Python-computed truth
|
||||
if metadata is not None and isinstance(metadata, dict):
|
||||
metadata["char_length"] = len(doc_text)
|
||||
print(f" B local: t={local_result['latency_s']}s metadata_parsed={metadata is not None}",
|
||||
flush=True)
|
||||
|
||||
if metadata is None:
|
||||
print(f" B: metadata parse failed — skipping API call", flush=True)
|
||||
results.append({
|
||||
"source": source,
|
||||
"size_bucket": size_bucket,
|
||||
"doc_chars_original": original_len,
|
||||
"doc_chars_sent": sent_len,
|
||||
"truncated": truncated,
|
||||
"condition_a": {
|
||||
"input_tokens": a.get("input_tokens"),
|
||||
"output_tokens": a.get("output_tokens"),
|
||||
"latency_s": a.get("latency_s"),
|
||||
"metrics": a_metrics,
|
||||
"stop_reason": a.get("stop_reason"),
|
||||
"response_text": a.get("response_text", "")[:32000],
|
||||
"error": a.get("error"),
|
||||
},
|
||||
"condition_b": {
|
||||
"skipped": "metadata_parse_failed",
|
||||
"local_latency_s": local_result.get("latency_s"),
|
||||
"local_raw": local_raw[:1000],
|
||||
},
|
||||
})
|
||||
continue
|
||||
|
||||
metadata_json = json.dumps(metadata, ensure_ascii=False, indent=2)
|
||||
b_prompt = CONDITION_B_API_PROMPT.replace("{metadata_json}", metadata_json) + doc_text
|
||||
|
||||
try:
|
||||
b = call_haiku(client, b_prompt)
|
||||
b_ents, b_edges, b_ok = parse_graph_full(b["response_text"])
|
||||
b_metrics = graph_metrics(b_ents, b_edges) if b_ok else None
|
||||
print(f" B api: in={b['input_tokens']} out={b['output_tokens']} "
|
||||
f"stop={b['stop_reason']} t={b['latency_s']}s", flush=True)
|
||||
print(f" {fmt_metrics(b_metrics)}", flush=True)
|
||||
except Exception as e:
|
||||
print(f" B api FAILED: {e}", flush=True)
|
||||
b = {"error": str(e)}
|
||||
b_metrics = None
|
||||
|
||||
# Per-doc deltas
|
||||
if "input_tokens" in a and "input_tokens" in b:
|
||||
in_pct = (b["input_tokens"] - a["input_tokens"]) / a["input_tokens"] * 100 if a["input_tokens"] else 0.0
|
||||
out_pct = (b["output_tokens"] - a["output_tokens"]) / a["output_tokens"] * 100 if a["output_tokens"] else 0.0
|
||||
edge_pct_str = "n/a"
|
||||
pred_pct_str = "n/a"
|
||||
if a_metrics and b_metrics:
|
||||
if a_metrics["n_edges"] > 0:
|
||||
edge_pct_str = f"{(b_metrics['n_edges'] - a_metrics['n_edges']) / a_metrics['n_edges'] * 100:+.1f}%"
|
||||
if a_metrics["predicate_diversity"] > 0:
|
||||
pred_pct_str = f"{(b_metrics['predicate_diversity'] - a_metrics['predicate_diversity']) / a_metrics['predicate_diversity'] * 100:+.1f}%"
|
||||
print(f" Δ in={in_pct:+.1f}% out={out_pct:+.1f}% edges={edge_pct_str} pred={pred_pct_str}",
|
||||
flush=True)
|
||||
|
||||
results.append({
|
||||
"source": source,
|
||||
"size_bucket": size_bucket,
|
||||
"doc_chars_original": original_len,
|
||||
"doc_chars_sent": sent_len,
|
||||
"truncated": truncated,
|
||||
"condition_a": {
|
||||
"input_tokens": a.get("input_tokens"),
|
||||
"output_tokens": a.get("output_tokens"),
|
||||
"latency_s": a.get("latency_s"),
|
||||
"metrics": a_metrics,
|
||||
"stop_reason": a.get("stop_reason"),
|
||||
"response_text": a.get("response_text", "")[:32000],
|
||||
"error": a.get("error"),
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": local_result.get("latency_s"),
|
||||
"local_metadata": metadata,
|
||||
"local_raw": local_raw[:1000],
|
||||
"api_input_tokens": b.get("input_tokens"),
|
||||
"api_output_tokens": b.get("output_tokens"),
|
||||
"api_latency_s": b.get("latency_s"),
|
||||
"metrics": b_metrics,
|
||||
"stop_reason": b.get("stop_reason"),
|
||||
"response_text": b.get("response_text", "")[:32000],
|
||||
"error": b.get("error"),
|
||||
},
|
||||
})
|
||||
|
||||
pg_conn.close()
|
||||
total_elapsed = round(time.time() - t_total, 1)
|
||||
|
||||
valid = [r for r in results
|
||||
if r.get("condition_a", {}).get("metrics") is not None
|
||||
and r.get("condition_b", {}).get("metrics") is not None]
|
||||
|
||||
a_in = sum(r["condition_a"]["input_tokens"] for r in valid)
|
||||
a_out = sum(r["condition_a"]["output_tokens"] for r in valid)
|
||||
b_in = sum(r["condition_b"]["api_input_tokens"] for r in valid)
|
||||
b_out = sum(r["condition_b"]["api_output_tokens"] for r in valid)
|
||||
a_cost = (a_in * HAIKU_IN_PER_M + a_out * HAIKU_OUT_PER_M) / 1_000_000
|
||||
b_cost = (b_in * HAIKU_IN_PER_M + b_out * HAIKU_OUT_PER_M) / 1_000_000
|
||||
|
||||
def avg_metric(rows, condition, key):
|
||||
vals = [r[condition]["metrics"][key] for r in rows if r[condition]["metrics"]]
|
||||
return round(statistics.mean(vals), 2) if vals else None
|
||||
|
||||
by_bucket = {}
|
||||
for bucket in ("small", "medium", "large"):
|
||||
rows = [r for r in valid if r["size_bucket"] == bucket]
|
||||
if not rows:
|
||||
by_bucket[bucket] = None
|
||||
continue
|
||||
ai = sum(r["condition_a"]["input_tokens"] for r in rows)
|
||||
ao = sum(r["condition_a"]["output_tokens"] for r in rows)
|
||||
bi = sum(r["condition_b"]["api_input_tokens"] for r in rows)
|
||||
bo = sum(r["condition_b"]["api_output_tokens"] for r in rows)
|
||||
by_bucket[bucket] = {
|
||||
"n": len(rows),
|
||||
"input_delta_pct": round((bi - ai) / ai * 100, 2) if ai else None,
|
||||
"output_delta_pct": round((bo - ao) / ao * 100, 2) if ao else None,
|
||||
"a_avg_entities": avg_metric(rows, "condition_a", "n_entities"),
|
||||
"b_avg_entities": avg_metric(rows, "condition_b", "n_entities"),
|
||||
"a_avg_edges": avg_metric(rows, "condition_a", "n_edges"),
|
||||
"b_avg_edges": avg_metric(rows, "condition_b", "n_edges"),
|
||||
"a_avg_predicate_diversity": avg_metric(rows, "condition_a", "predicate_diversity"),
|
||||
"b_avg_predicate_diversity": avg_metric(rows, "condition_b", "predicate_diversity"),
|
||||
"a_avg_type_diversity": avg_metric(rows, "condition_a", "type_diversity"),
|
||||
"b_avg_type_diversity": avg_metric(rows, "condition_b", "type_diversity"),
|
||||
"a_avg_degree": avg_metric(rows, "condition_a", "avg_degree"),
|
||||
"b_avg_degree": avg_metric(rows, "condition_b", "avg_degree"),
|
||||
"a_avg_largest_component_pct": avg_metric(rows, "condition_a", "largest_component_pct"),
|
||||
"b_avg_largest_component_pct": avg_metric(rows, "condition_b", "largest_component_pct"),
|
||||
}
|
||||
|
||||
summary = {
|
||||
"experiment": "base_class_test",
|
||||
"title": "Base-Class Enrichment — OOP Framing",
|
||||
"started_at": started_at,
|
||||
"completed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"haiku_model": HAIKU_MODEL,
|
||||
"local_model": LOCAL_MODEL,
|
||||
"max_doc_chars": MAX_DOC_CHARS,
|
||||
"n_documents": len(sample),
|
||||
"n_valid_pairs": len(valid),
|
||||
"total_elapsed_s": total_elapsed,
|
||||
"totals": {
|
||||
"a_input_tokens": a_in,
|
||||
"a_output_tokens": a_out,
|
||||
"b_input_tokens": b_in,
|
||||
"b_output_tokens": b_out,
|
||||
"a_cost_usd": round(a_cost, 4),
|
||||
"b_cost_usd": round(b_cost, 4),
|
||||
"cost_delta_usd": round(b_cost - a_cost, 4),
|
||||
"cost_delta_pct": round((b_cost - a_cost) / a_cost * 100, 2) if a_cost else None,
|
||||
"note": "API cost only — local Mistral runtime on VPS not monetized",
|
||||
},
|
||||
"by_size_bucket": by_bucket,
|
||||
"results": results,
|
||||
}
|
||||
|
||||
OUTPUT_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(OUTPUT_FILE, "w") as f:
|
||||
json.dump(summary, f, indent=2)
|
||||
|
||||
print()
|
||||
print("=" * 60)
|
||||
print(f"DONE — {len(valid)}/{len(sample)} valid pairs in {total_elapsed}s")
|
||||
print(f"A total cost: ${a_cost:.4f} (in={a_in} out={a_out})")
|
||||
print(f"B total cost: ${b_cost:.4f} (in={b_in} out={b_out})")
|
||||
delta_pct = summary['totals']['cost_delta_pct']
|
||||
if delta_pct is not None:
|
||||
verdict = "B cheaper" if delta_pct < 0 else "B more expensive"
|
||||
print(f"Cost delta: {delta_pct:+.2f}% ({verdict})")
|
||||
print()
|
||||
print("By bucket — graph metrics (A vs B):")
|
||||
for bucket, stats in by_bucket.items():
|
||||
if stats:
|
||||
print(f" {bucket:6s} (n={stats['n']}):")
|
||||
print(f" cost: in {stats['input_delta_pct']:+.1f}% out {stats['output_delta_pct']:+.1f}%")
|
||||
print(f" entities: A={stats['a_avg_entities']} B={stats['b_avg_entities']}")
|
||||
print(f" edges: A={stats['a_avg_edges']} B={stats['b_avg_edges']}")
|
||||
print(f" predicate diversity: A={stats['a_avg_predicate_diversity']} B={stats['b_avg_predicate_diversity']}")
|
||||
print(f" type diversity: A={stats['a_avg_type_diversity']} B={stats['b_avg_type_diversity']}")
|
||||
print(f" avg degree: A={stats['a_avg_degree']} B={stats['b_avg_degree']}")
|
||||
print(f" largest component %: A={stats['a_avg_largest_component_pct']} B={stats['b_avg_largest_component_pct']}")
|
||||
print()
|
||||
print(f"Results: {OUTPUT_FILE}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,593 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Base-Class Enrichment Test — OOP Framing Experiment
|
||||
|
||||
Tests whether non-entity metadata from a local model (domain class, structural
|
||||
signals, presence flags, length, summary) can take load off the API without
|
||||
constraining what it extracts.
|
||||
|
||||
The local model does NOT draft entities. The API still does full extraction.
|
||||
The local model produces metadata that orients the API's reading.
|
||||
|
||||
Conditions:
|
||||
A — Baseline: single Claude Haiku call, full extraction, no metadata
|
||||
B — Base-class: Mistral metadata + Haiku full extraction with metadata as frame
|
||||
|
||||
Critical test: B's edge count and predicate diversity must be ≥A's, or close.
|
||||
If B produces fewer edges or less predicate diversity, metadata is acting as
|
||||
constraint and the OOP framing is falsified.
|
||||
|
||||
Sample: 20 docs from briefing_test_v2_results.json:
|
||||
- 5 small (<1000 chars)
|
||||
- 10 medium (1000-5000 chars)
|
||||
- 5 large (5000-12000 chars, capped at 12K)
|
||||
|
||||
Outputs: ~/aaronai/experiments/base_class_test_results.json
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import statistics
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import anthropic
|
||||
import psycopg2
|
||||
import requests
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
V2_FILE = Path.home() / "aaronai" / "briefing_test_v2_results.json"
|
||||
OUTPUT_FILE = Path.home() / "aaronai" / "experiments" / "base_class_test_results.json"
|
||||
HAIKU_MODEL = "claude-haiku-4-5-20251001"
|
||||
HAIKU_MAX_TOKENS = 4096
|
||||
HAIKU_TEMPERATURE = 0.0
|
||||
OLLAMA_URL = "http://localhost:11434/api/generate"
|
||||
LOCAL_MODEL = "mistral"
|
||||
LOCAL_TIMEOUT = 180
|
||||
MAX_DOC_CHARS = 12000
|
||||
|
||||
HAIKU_IN_PER_M = 1.0
|
||||
HAIKU_OUT_PER_M = 5.0
|
||||
|
||||
|
||||
CONDITION_A_PROMPT = """Extract a knowledge graph from the document below.
|
||||
|
||||
Return ONLY valid JSON with this exact schema:
|
||||
{
|
||||
"entities": [
|
||||
{"name": string, "type": string}
|
||||
],
|
||||
"edges": [
|
||||
{"subject": string, "predicate": string, "object": string}
|
||||
]
|
||||
}
|
||||
|
||||
Entity types: use whatever fits the entity. Do not constrain yourself to a fixed list.
|
||||
|
||||
Edge predicates: natural language phrases that capture the actual relationship the document states or implies.
|
||||
|
||||
Extract every entity and every relationship the document states or strongly implies. Both subject and object in every edge must appear in entities. JSON only, no commentary, no markdown fences.
|
||||
|
||||
DOCUMENT:
|
||||
"""
|
||||
|
||||
LOCAL_METADATA_PROMPT = """Analyze the document below and produce metadata describing its surface features. Do NOT extract entities. Do NOT identify content. Only produce structural and surface-level metadata.
|
||||
|
||||
Return ONLY valid JSON with this exact schema:
|
||||
{
|
||||
"language": "en or other",
|
||||
"char_length": integer,
|
||||
"primary_format": "prose, presentation, list, form, code, or mixed",
|
||||
"structural_signals": {
|
||||
"has_headings": boolean,
|
||||
"has_bullet_lists": boolean,
|
||||
"has_numbered_lists": boolean,
|
||||
"has_tables": boolean,
|
||||
"has_code_blocks": boolean,
|
||||
"has_dates": boolean
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": boolean,
|
||||
"has_institutional_language": boolean,
|
||||
"has_technical_terminology": boolean,
|
||||
"has_first_person": boolean,
|
||||
"has_quotations": boolean
|
||||
},
|
||||
"domain_class": "technical, administrative, personal, educational, creative, reference, or mixed",
|
||||
"one_sentence_summary": "string of 25 words or fewer describing what the document is about"
|
||||
}
|
||||
|
||||
JSON only, no commentary.
|
||||
|
||||
DOCUMENT:
|
||||
"""
|
||||
|
||||
CONDITION_B_API_PROMPT = """You are extracting a knowledge graph from a document. The document has been pre-analyzed by a local model and the following metadata is provided as orienting context — not as constraint. Extract every entity and every relationship in the document. Do not limit your extraction to what the metadata suggests; the metadata is here to orient your reading, not to bound it.
|
||||
|
||||
DOCUMENT METADATA:
|
||||
{metadata_json}
|
||||
|
||||
Return ONLY valid JSON with this exact schema:
|
||||
{
|
||||
"entities": [
|
||||
{"name": string, "type": string}
|
||||
],
|
||||
"edges": [
|
||||
{"subject": string, "predicate": string, "object": string}
|
||||
]
|
||||
}
|
||||
|
||||
Entity types: use whatever fits. Edge predicates: natural language phrases capturing the actual relationship. Both subject and object in every edge must appear in entities. Extract every entity and every relationship the document states or strongly implies. Do not filter for salience. JSON only, no commentary, no markdown fences.
|
||||
|
||||
DOCUMENT:
|
||||
"""
|
||||
|
||||
|
||||
def strip_json_fences(text):
|
||||
if not text:
|
||||
return ""
|
||||
t = text.strip()
|
||||
t = re.sub(r"^```(?:json)?\s*", "", t)
|
||||
t = re.sub(r"\s*```$", "", t)
|
||||
return t.strip()
|
||||
|
||||
|
||||
def fetch_document_text(pg_conn, source):
|
||||
cur = pg_conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT document FROM embeddings WHERE source = %s ORDER BY id",
|
||||
(source,),
|
||||
)
|
||||
rows = cur.fetchall()
|
||||
cur.close()
|
||||
if not rows:
|
||||
return None, 0
|
||||
full = "\n\n".join(r[0] for r in rows)
|
||||
return full[:MAX_DOC_CHARS], len(full)
|
||||
|
||||
|
||||
def call_haiku(client, prompt_text):
|
||||
t0 = time.time()
|
||||
resp = client.messages.create(
|
||||
model=HAIKU_MODEL,
|
||||
max_tokens=HAIKU_MAX_TOKENS,
|
||||
temperature=HAIKU_TEMPERATURE,
|
||||
messages=[{"role": "user", "content": prompt_text}],
|
||||
)
|
||||
return {
|
||||
"input_tokens": resp.usage.input_tokens,
|
||||
"output_tokens": resp.usage.output_tokens,
|
||||
"latency_s": round(time.time() - t0, 2),
|
||||
"response_text": resp.content[0].text if resp.content else "",
|
||||
"stop_reason": resp.stop_reason,
|
||||
}
|
||||
|
||||
|
||||
def call_local_metadata(document_text):
|
||||
t0 = time.time()
|
||||
try:
|
||||
resp = requests.post(
|
||||
OLLAMA_URL,
|
||||
json={
|
||||
"model": LOCAL_MODEL,
|
||||
"prompt": LOCAL_METADATA_PROMPT + document_text,
|
||||
"stream": False,
|
||||
"format": "json",
|
||||
"options": {"num_predict": 1024, "temperature": 0, "num_ctx": 12288},
|
||||
},
|
||||
timeout=LOCAL_TIMEOUT,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return {
|
||||
"response": resp.json().get("response", ""),
|
||||
"latency_s": round(time.time() - t0, 2),
|
||||
}
|
||||
except Exception as e:
|
||||
return {"error": str(e), "latency_s": round(time.time() - t0, 2)}
|
||||
|
||||
|
||||
def parse_graph_full(raw):
|
||||
"""Return (entities_list, edges_list, parsed_ok). Lists for metric computation."""
|
||||
cleaned = strip_json_fences(raw)
|
||||
if not cleaned:
|
||||
return None, None, False
|
||||
try:
|
||||
data = json.loads(cleaned)
|
||||
except json.JSONDecodeError:
|
||||
return None, None, False
|
||||
if not isinstance(data, dict):
|
||||
return None, None, False
|
||||
ents = data.get("entities")
|
||||
edges = data.get("edges")
|
||||
if isinstance(ents, list) and isinstance(edges, list):
|
||||
return ents, edges, True
|
||||
return None, None, False
|
||||
|
||||
|
||||
def parse_metadata(raw):
|
||||
cleaned = strip_json_fences(raw)
|
||||
if not cleaned:
|
||||
return None
|
||||
try:
|
||||
return json.loads(cleaned)
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
|
||||
|
||||
def graph_metrics(entities, edges):
|
||||
"""Compute graph quality metrics. Inputs are lists from parse_graph_full."""
|
||||
if entities is None or edges is None:
|
||||
return None
|
||||
n_entities = len(entities)
|
||||
n_edges = len(edges)
|
||||
|
||||
# Predicate diversity
|
||||
predicates = set()
|
||||
for e in edges:
|
||||
if isinstance(e, dict):
|
||||
p = e.get("predicate")
|
||||
if p:
|
||||
predicates.add(str(p).strip().lower())
|
||||
predicate_diversity = len(predicates)
|
||||
|
||||
# Entity type diversity
|
||||
types = set()
|
||||
for ent in entities:
|
||||
if isinstance(ent, dict):
|
||||
t = ent.get("type")
|
||||
if t:
|
||||
types.add(str(t).strip().lower())
|
||||
type_diversity = len(types)
|
||||
|
||||
# Average degree (edges*2 / entities — each edge touches two nodes)
|
||||
avg_degree = (2 * n_edges / n_entities) if n_entities > 0 else 0.0
|
||||
|
||||
# Largest connected component
|
||||
# Build adjacency from edges
|
||||
entity_names = set()
|
||||
for ent in entities:
|
||||
if isinstance(ent, dict):
|
||||
n = ent.get("name")
|
||||
if n:
|
||||
entity_names.add(str(n).strip().lower())
|
||||
|
||||
adj = {name: set() for name in entity_names}
|
||||
for e in edges:
|
||||
if not isinstance(e, dict):
|
||||
continue
|
||||
s = str(e.get("subject", "")).strip().lower()
|
||||
o = str(e.get("object", "")).strip().lower()
|
||||
if s in adj and o in adj:
|
||||
adj[s].add(o)
|
||||
adj[o].add(s)
|
||||
|
||||
# BFS for largest component
|
||||
visited = set()
|
||||
largest = 0
|
||||
for start in adj:
|
||||
if start in visited:
|
||||
continue
|
||||
component = 0
|
||||
stack = [start]
|
||||
while stack:
|
||||
node = stack.pop()
|
||||
if node in visited:
|
||||
continue
|
||||
visited.add(node)
|
||||
component += 1
|
||||
for neighbor in adj[node]:
|
||||
if neighbor not in visited:
|
||||
stack.append(neighbor)
|
||||
if component > largest:
|
||||
largest = component
|
||||
|
||||
return {
|
||||
"n_entities": n_entities,
|
||||
"n_edges": n_edges,
|
||||
"predicate_diversity": predicate_diversity,
|
||||
"type_diversity": type_diversity,
|
||||
"avg_degree": round(avg_degree, 2),
|
||||
"largest_component": largest,
|
||||
"largest_component_pct": round(100 * largest / n_entities, 1) if n_entities else 0.0,
|
||||
}
|
||||
|
||||
|
||||
def stratify(docs):
|
||||
sized = [(d, d["content_length"]) for d in docs]
|
||||
small = [d for d, n in sized if n < 1000]
|
||||
medium = [d for d, n in sized if 1000 <= n < 5000]
|
||||
large = [d for d, n in sized if n >= 5000]
|
||||
return small[:5] + medium[:10] + large[:5]
|
||||
|
||||
|
||||
def fmt_metrics(m):
|
||||
if m is None:
|
||||
return "n/a"
|
||||
return (f"e={m['n_entities']} edge={m['n_edges']} "
|
||||
f"pred={m['predicate_diversity']} type={m['type_diversity']} "
|
||||
f"deg={m['avg_degree']} comp={m['largest_component']}/{m['n_entities']}")
|
||||
|
||||
|
||||
def main():
|
||||
api_key = os.environ.get("ANTHROPIC_API_KEY")
|
||||
pg_dsn = os.environ.get("PG_DSN")
|
||||
if not api_key or not pg_dsn:
|
||||
print("ERROR: ANTHROPIC_API_KEY or PG_DSN not set", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
if not V2_FILE.exists():
|
||||
print(f"ERROR: {V2_FILE} not found", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
with open(V2_FILE) as f:
|
||||
v2 = json.load(f)
|
||||
|
||||
docs_meta = [d for d in v2["documents"] if d.get("status") == "SUCCESS"]
|
||||
sample = stratify(docs_meta)
|
||||
print(f"Sample: {len(sample)} docs (5s/10m/5l, file order)")
|
||||
print(f"Mistral context: 12288 tokens, doc cap {MAX_DOC_CHARS} chars")
|
||||
print(f"Haiku model: {HAIKU_MODEL} temp={HAIKU_TEMPERATURE}")
|
||||
print(f"Test: base-class metadata as orienting frame, NOT entity drafting")
|
||||
print()
|
||||
|
||||
client = anthropic.Anthropic(api_key=api_key)
|
||||
pg_conn = psycopg2.connect(pg_dsn)
|
||||
|
||||
results = []
|
||||
started_at = datetime.now(timezone.utc).isoformat()
|
||||
t_total = time.time()
|
||||
|
||||
for i, doc_meta in enumerate(sample, 1):
|
||||
source = doc_meta["source"]
|
||||
doc_text, original_len = fetch_document_text(pg_conn, source)
|
||||
if not doc_text:
|
||||
print(f"[{i:02d}/{len(sample)}] {source[:55]} — SKIP (not in pgvector)")
|
||||
results.append({"source": source, "skipped": "not_in_pgvector"})
|
||||
continue
|
||||
|
||||
sent_len = len(doc_text)
|
||||
truncated = original_len > sent_len
|
||||
size_bucket = (
|
||||
"small" if sent_len < 1000
|
||||
else "medium" if sent_len < 5000
|
||||
else "large"
|
||||
)
|
||||
trunc_marker = "*" if truncated else " "
|
||||
print(f"[{i:02d}/{len(sample)}] [{size_bucket:6s}] [{sent_len:>5}c{trunc_marker}] {source[:55]}", flush=True)
|
||||
|
||||
# Condition A
|
||||
try:
|
||||
a = call_haiku(client, CONDITION_A_PROMPT + doc_text)
|
||||
a_ents, a_edges, a_ok = parse_graph_full(a["response_text"])
|
||||
a_metrics = graph_metrics(a_ents, a_edges) if a_ok else None
|
||||
print(f" A: in={a['input_tokens']} out={a['output_tokens']} "
|
||||
f"stop={a['stop_reason']} t={a['latency_s']}s", flush=True)
|
||||
print(f" {fmt_metrics(a_metrics)}", flush=True)
|
||||
except Exception as e:
|
||||
print(f" A FAILED: {e}", flush=True)
|
||||
a = {"error": str(e)}
|
||||
a_metrics = None
|
||||
|
||||
# Condition B local metadata pass
|
||||
local_result = call_local_metadata(doc_text)
|
||||
if "error" in local_result:
|
||||
print(f" B local FAILED: {local_result['error']}", flush=True)
|
||||
results.append({
|
||||
"source": source,
|
||||
"size_bucket": size_bucket,
|
||||
"doc_chars_original": original_len,
|
||||
"doc_chars_sent": sent_len,
|
||||
"truncated": truncated,
|
||||
"condition_a": {
|
||||
"input_tokens": a.get("input_tokens"),
|
||||
"output_tokens": a.get("output_tokens"),
|
||||
"latency_s": a.get("latency_s"),
|
||||
"metrics": a_metrics,
|
||||
"stop_reason": a.get("stop_reason"),
|
||||
"response_text": a.get("response_text", "")[:4000],
|
||||
"error": a.get("error"),
|
||||
},
|
||||
"condition_b": {
|
||||
"skipped": "local_model_failed",
|
||||
"local_error": local_result["error"],
|
||||
"local_latency_s": local_result.get("latency_s"),
|
||||
},
|
||||
})
|
||||
continue
|
||||
|
||||
local_raw = local_result["response"]
|
||||
metadata = parse_metadata(local_raw)
|
||||
print(f" B local: t={local_result['latency_s']}s metadata_parsed={metadata is not None}",
|
||||
flush=True)
|
||||
|
||||
if metadata is None:
|
||||
print(f" B: metadata parse failed — skipping API call", flush=True)
|
||||
results.append({
|
||||
"source": source,
|
||||
"size_bucket": size_bucket,
|
||||
"doc_chars_original": original_len,
|
||||
"doc_chars_sent": sent_len,
|
||||
"truncated": truncated,
|
||||
"condition_a": {
|
||||
"input_tokens": a.get("input_tokens"),
|
||||
"output_tokens": a.get("output_tokens"),
|
||||
"latency_s": a.get("latency_s"),
|
||||
"metrics": a_metrics,
|
||||
"stop_reason": a.get("stop_reason"),
|
||||
"response_text": a.get("response_text", "")[:4000],
|
||||
"error": a.get("error"),
|
||||
},
|
||||
"condition_b": {
|
||||
"skipped": "metadata_parse_failed",
|
||||
"local_latency_s": local_result.get("latency_s"),
|
||||
"local_raw": local_raw[:1000],
|
||||
},
|
||||
})
|
||||
continue
|
||||
|
||||
metadata_json = json.dumps(metadata, ensure_ascii=False, indent=2)
|
||||
b_prompt = CONDITION_B_API_PROMPT.replace("{metadata_json}", metadata_json) + doc_text
|
||||
|
||||
try:
|
||||
b = call_haiku(client, b_prompt)
|
||||
b_ents, b_edges, b_ok = parse_graph_full(b["response_text"])
|
||||
b_metrics = graph_metrics(b_ents, b_edges) if b_ok else None
|
||||
print(f" B api: in={b['input_tokens']} out={b['output_tokens']} "
|
||||
f"stop={b['stop_reason']} t={b['latency_s']}s", flush=True)
|
||||
print(f" {fmt_metrics(b_metrics)}", flush=True)
|
||||
except Exception as e:
|
||||
print(f" B api FAILED: {e}", flush=True)
|
||||
b = {"error": str(e)}
|
||||
b_metrics = None
|
||||
|
||||
# Per-doc deltas
|
||||
if "input_tokens" in a and "input_tokens" in b:
|
||||
in_pct = (b["input_tokens"] - a["input_tokens"]) / a["input_tokens"] * 100 if a["input_tokens"] else 0.0
|
||||
out_pct = (b["output_tokens"] - a["output_tokens"]) / a["output_tokens"] * 100 if a["output_tokens"] else 0.0
|
||||
edge_pct_str = "n/a"
|
||||
pred_pct_str = "n/a"
|
||||
if a_metrics and b_metrics:
|
||||
if a_metrics["n_edges"] > 0:
|
||||
edge_pct_str = f"{(b_metrics['n_edges'] - a_metrics['n_edges']) / a_metrics['n_edges'] * 100:+.1f}%"
|
||||
if a_metrics["predicate_diversity"] > 0:
|
||||
pred_pct_str = f"{(b_metrics['predicate_diversity'] - a_metrics['predicate_diversity']) / a_metrics['predicate_diversity'] * 100:+.1f}%"
|
||||
print(f" Δ in={in_pct:+.1f}% out={out_pct:+.1f}% edges={edge_pct_str} pred={pred_pct_str}",
|
||||
flush=True)
|
||||
|
||||
results.append({
|
||||
"source": source,
|
||||
"size_bucket": size_bucket,
|
||||
"doc_chars_original": original_len,
|
||||
"doc_chars_sent": sent_len,
|
||||
"truncated": truncated,
|
||||
"condition_a": {
|
||||
"input_tokens": a.get("input_tokens"),
|
||||
"output_tokens": a.get("output_tokens"),
|
||||
"latency_s": a.get("latency_s"),
|
||||
"metrics": a_metrics,
|
||||
"stop_reason": a.get("stop_reason"),
|
||||
"response_text": a.get("response_text", "")[:4000],
|
||||
"error": a.get("error"),
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": local_result.get("latency_s"),
|
||||
"local_metadata": metadata,
|
||||
"local_raw": local_raw[:1000],
|
||||
"api_input_tokens": b.get("input_tokens"),
|
||||
"api_output_tokens": b.get("output_tokens"),
|
||||
"api_latency_s": b.get("latency_s"),
|
||||
"metrics": b_metrics,
|
||||
"stop_reason": b.get("stop_reason"),
|
||||
"response_text": b.get("response_text", "")[:4000],
|
||||
"error": b.get("error"),
|
||||
},
|
||||
})
|
||||
|
||||
pg_conn.close()
|
||||
total_elapsed = round(time.time() - t_total, 1)
|
||||
|
||||
valid = [r for r in results
|
||||
if r.get("condition_a", {}).get("metrics") is not None
|
||||
and r.get("condition_b", {}).get("metrics") is not None]
|
||||
|
||||
a_in = sum(r["condition_a"]["input_tokens"] for r in valid)
|
||||
a_out = sum(r["condition_a"]["output_tokens"] for r in valid)
|
||||
b_in = sum(r["condition_b"]["api_input_tokens"] for r in valid)
|
||||
b_out = sum(r["condition_b"]["api_output_tokens"] for r in valid)
|
||||
a_cost = (a_in * HAIKU_IN_PER_M + a_out * HAIKU_OUT_PER_M) / 1_000_000
|
||||
b_cost = (b_in * HAIKU_IN_PER_M + b_out * HAIKU_OUT_PER_M) / 1_000_000
|
||||
|
||||
def avg_metric(rows, condition, key):
|
||||
vals = [r[condition]["metrics"][key] for r in rows if r[condition]["metrics"]]
|
||||
return round(statistics.mean(vals), 2) if vals else None
|
||||
|
||||
by_bucket = {}
|
||||
for bucket in ("small", "medium", "large"):
|
||||
rows = [r for r in valid if r["size_bucket"] == bucket]
|
||||
if not rows:
|
||||
by_bucket[bucket] = None
|
||||
continue
|
||||
ai = sum(r["condition_a"]["input_tokens"] for r in rows)
|
||||
ao = sum(r["condition_a"]["output_tokens"] for r in rows)
|
||||
bi = sum(r["condition_b"]["api_input_tokens"] for r in rows)
|
||||
bo = sum(r["condition_b"]["api_output_tokens"] for r in rows)
|
||||
by_bucket[bucket] = {
|
||||
"n": len(rows),
|
||||
"input_delta_pct": round((bi - ai) / ai * 100, 2) if ai else None,
|
||||
"output_delta_pct": round((bo - ao) / ao * 100, 2) if ao else None,
|
||||
"a_avg_entities": avg_metric(rows, "condition_a", "n_entities"),
|
||||
"b_avg_entities": avg_metric(rows, "condition_b", "n_entities"),
|
||||
"a_avg_edges": avg_metric(rows, "condition_a", "n_edges"),
|
||||
"b_avg_edges": avg_metric(rows, "condition_b", "n_edges"),
|
||||
"a_avg_predicate_diversity": avg_metric(rows, "condition_a", "predicate_diversity"),
|
||||
"b_avg_predicate_diversity": avg_metric(rows, "condition_b", "predicate_diversity"),
|
||||
"a_avg_type_diversity": avg_metric(rows, "condition_a", "type_diversity"),
|
||||
"b_avg_type_diversity": avg_metric(rows, "condition_b", "type_diversity"),
|
||||
"a_avg_degree": avg_metric(rows, "condition_a", "avg_degree"),
|
||||
"b_avg_degree": avg_metric(rows, "condition_b", "avg_degree"),
|
||||
"a_avg_largest_component_pct": avg_metric(rows, "condition_a", "largest_component_pct"),
|
||||
"b_avg_largest_component_pct": avg_metric(rows, "condition_b", "largest_component_pct"),
|
||||
}
|
||||
|
||||
summary = {
|
||||
"experiment": "base_class_test",
|
||||
"title": "Base-Class Enrichment — OOP Framing",
|
||||
"started_at": started_at,
|
||||
"completed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"haiku_model": HAIKU_MODEL,
|
||||
"local_model": LOCAL_MODEL,
|
||||
"max_doc_chars": MAX_DOC_CHARS,
|
||||
"n_documents": len(sample),
|
||||
"n_valid_pairs": len(valid),
|
||||
"total_elapsed_s": total_elapsed,
|
||||
"totals": {
|
||||
"a_input_tokens": a_in,
|
||||
"a_output_tokens": a_out,
|
||||
"b_input_tokens": b_in,
|
||||
"b_output_tokens": b_out,
|
||||
"a_cost_usd": round(a_cost, 4),
|
||||
"b_cost_usd": round(b_cost, 4),
|
||||
"cost_delta_usd": round(b_cost - a_cost, 4),
|
||||
"cost_delta_pct": round((b_cost - a_cost) / a_cost * 100, 2) if a_cost else None,
|
||||
"note": "API cost only — local Mistral runtime on VPS not monetized",
|
||||
},
|
||||
"by_size_bucket": by_bucket,
|
||||
"results": results,
|
||||
}
|
||||
|
||||
OUTPUT_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(OUTPUT_FILE, "w") as f:
|
||||
json.dump(summary, f, indent=2)
|
||||
|
||||
print()
|
||||
print("=" * 60)
|
||||
print(f"DONE — {len(valid)}/{len(sample)} valid pairs in {total_elapsed}s")
|
||||
print(f"A total cost: ${a_cost:.4f} (in={a_in} out={a_out})")
|
||||
print(f"B total cost: ${b_cost:.4f} (in={b_in} out={b_out})")
|
||||
delta_pct = summary['totals']['cost_delta_pct']
|
||||
if delta_pct is not None:
|
||||
verdict = "B cheaper" if delta_pct < 0 else "B more expensive"
|
||||
print(f"Cost delta: {delta_pct:+.2f}% ({verdict})")
|
||||
print()
|
||||
print("By bucket — graph metrics (A vs B):")
|
||||
for bucket, stats in by_bucket.items():
|
||||
if stats:
|
||||
print(f" {bucket:6s} (n={stats['n']}):")
|
||||
print(f" cost: in {stats['input_delta_pct']:+.1f}% out {stats['output_delta_pct']:+.1f}%")
|
||||
print(f" entities: A={stats['a_avg_entities']} B={stats['b_avg_entities']}")
|
||||
print(f" edges: A={stats['a_avg_edges']} B={stats['b_avg_edges']}")
|
||||
print(f" predicate diversity: A={stats['a_avg_predicate_diversity']} B={stats['b_avg_predicate_diversity']}")
|
||||
print(f" type diversity: A={stats['a_avg_type_diversity']} B={stats['b_avg_type_diversity']}")
|
||||
print(f" avg degree: A={stats['a_avg_degree']} B={stats['b_avg_degree']}")
|
||||
print(f" largest component %: A={stats['a_avg_largest_component_pct']} B={stats['b_avg_largest_component_pct']}")
|
||||
print()
|
||||
print(f"Results: {OUTPUT_FILE}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,611 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Base-Class Enrichment Test — OOP Framing Experiment
|
||||
|
||||
Tests whether non-entity metadata from a local model (domain class, structural
|
||||
signals, presence flags, length, summary) can take load off the API without
|
||||
constraining what it extracts.
|
||||
|
||||
The local model does NOT draft entities. The API still does full extraction.
|
||||
The local model produces metadata that orients the API's reading.
|
||||
|
||||
Conditions:
|
||||
A — Baseline: single Claude Haiku call, full extraction, no metadata
|
||||
B — Base-class: Mistral metadata + Haiku full extraction with metadata as frame
|
||||
|
||||
Critical test: B's edge count and predicate diversity must be ≥A's, or close.
|
||||
If B produces fewer edges or less predicate diversity, metadata is acting as
|
||||
constraint and the OOP framing is falsified.
|
||||
|
||||
Sample: 50 docs from briefing_test_v2_results.json:
|
||||
- 15 small (<1000 chars)
|
||||
- 25 medium (1000-5000 chars)
|
||||
- 10 large (5000-12000 chars, capped at 12K)
|
||||
|
||||
Outputs: ~/aaronai/experiments/base_class_validation_results.json
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import statistics
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import anthropic
|
||||
import psycopg2
|
||||
import requests
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
V2_FILE = Path.home() / "aaronai" / "briefing_test_v2_results.json"
|
||||
OUTPUT_FILE = Path.home() / "aaronai" / "experiments" / "base_class_validation_results.json"
|
||||
HAIKU_MODEL = "claude-haiku-4-5-20251001"
|
||||
HAIKU_MAX_TOKENS = 8192
|
||||
HAIKU_TEMPERATURE = 0.0
|
||||
OLLAMA_URL = "http://localhost:11434/api/generate"
|
||||
LOCAL_MODEL = "mistral"
|
||||
LOCAL_TIMEOUT = 180
|
||||
MAX_DOC_CHARS = 12000
|
||||
|
||||
HAIKU_IN_PER_M = 1.0
|
||||
HAIKU_OUT_PER_M = 5.0
|
||||
|
||||
|
||||
CONDITION_A_PROMPT = """Extract a knowledge graph from the document below.
|
||||
|
||||
Return ONLY valid JSON with this exact schema:
|
||||
{
|
||||
"entities": [
|
||||
{"name": string, "type": string}
|
||||
],
|
||||
"edges": [
|
||||
{"subject": string, "predicate": string, "object": string}
|
||||
]
|
||||
}
|
||||
|
||||
Entity types: use whatever fits the entity. Do not constrain yourself to a fixed list.
|
||||
|
||||
Edge predicates: natural language phrases that capture the actual relationship the document states or implies.
|
||||
|
||||
Extract every entity and every relationship the document states or strongly implies. Both subject and object in every edge must appear in entities. JSON only, no commentary, no markdown fences.
|
||||
|
||||
DOCUMENT:
|
||||
"""
|
||||
|
||||
LOCAL_METADATA_PROMPT = """Analyze the document below and produce metadata describing its surface features. Do NOT extract entities. Do NOT identify content. Only produce structural and surface-level metadata.
|
||||
|
||||
Return ONLY valid JSON with this exact schema:
|
||||
{
|
||||
"language": "en or other",
|
||||
"char_length": integer,
|
||||
"primary_format": "prose, presentation, list, form, code, or mixed",
|
||||
"structural_signals": {
|
||||
"has_headings": boolean,
|
||||
"has_bullet_lists": boolean,
|
||||
"has_numbered_lists": boolean,
|
||||
"has_tables": boolean,
|
||||
"has_code_blocks": boolean,
|
||||
"has_dates": boolean
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": boolean,
|
||||
"has_institutional_language": boolean,
|
||||
"has_technical_terminology": boolean,
|
||||
"has_first_person": boolean,
|
||||
"has_quotations": boolean
|
||||
},
|
||||
"domain_class": "technical, administrative, personal, educational, creative, reference, or mixed",
|
||||
"one_sentence_summary": "string of 25 words or fewer describing what the document is about"
|
||||
}
|
||||
|
||||
JSON only, no commentary.
|
||||
|
||||
DOCUMENT:
|
||||
"""
|
||||
|
||||
CONDITION_B_API_PROMPT = """You are extracting a knowledge graph from a document. The document has been pre-analyzed by a local model and the following metadata is provided as orienting context — not as constraint. Extract every entity and every relationship in the document. Do not limit your extraction to what the metadata suggests; the metadata is here to orient your reading, not to bound it.
|
||||
|
||||
DOCUMENT METADATA:
|
||||
{metadata_json}
|
||||
|
||||
Return ONLY valid JSON with this exact schema:
|
||||
{
|
||||
"entities": [
|
||||
{"name": string, "type": string}
|
||||
],
|
||||
"edges": [
|
||||
{"subject": string, "predicate": string, "object": string}
|
||||
]
|
||||
}
|
||||
|
||||
Entity types: use whatever fits. Edge predicates: natural language phrases capturing the actual relationship. Both subject and object in every edge must appear in entities. Extract every entity and every relationship the document states or strongly implies. Do not filter for salience. JSON only, no commentary, no markdown fences.
|
||||
|
||||
DOCUMENT:
|
||||
"""
|
||||
|
||||
|
||||
def strip_json_fences(text):
|
||||
if not text:
|
||||
return ""
|
||||
t = text.strip()
|
||||
t = re.sub(r"^```(?:json)?\s*", "", t)
|
||||
t = re.sub(r"\s*```$", "", t)
|
||||
return t.strip()
|
||||
|
||||
|
||||
def fetch_document_text(pg_conn, source):
|
||||
cur = pg_conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT document FROM embeddings WHERE source = %s ORDER BY id",
|
||||
(source,),
|
||||
)
|
||||
rows = cur.fetchall()
|
||||
cur.close()
|
||||
if not rows:
|
||||
return None, 0
|
||||
full = "\n\n".join(r[0] for r in rows)
|
||||
return full[:MAX_DOC_CHARS], len(full)
|
||||
|
||||
|
||||
def call_haiku(client, prompt_text):
|
||||
t0 = time.time()
|
||||
resp = client.messages.create(
|
||||
model=HAIKU_MODEL,
|
||||
max_tokens=HAIKU_MAX_TOKENS,
|
||||
temperature=HAIKU_TEMPERATURE,
|
||||
messages=[{"role": "user", "content": prompt_text}],
|
||||
)
|
||||
return {
|
||||
"input_tokens": resp.usage.input_tokens,
|
||||
"output_tokens": resp.usage.output_tokens,
|
||||
"latency_s": round(time.time() - t0, 2),
|
||||
"response_text": resp.content[0].text if resp.content else "",
|
||||
"stop_reason": resp.stop_reason,
|
||||
}
|
||||
|
||||
|
||||
def call_local_metadata(document_text):
|
||||
t0 = time.time()
|
||||
try:
|
||||
resp = requests.post(
|
||||
OLLAMA_URL,
|
||||
json={
|
||||
"model": LOCAL_MODEL,
|
||||
"prompt": LOCAL_METADATA_PROMPT + document_text,
|
||||
"stream": False,
|
||||
"format": "json",
|
||||
"options": {"num_predict": 1024, "temperature": 0, "num_ctx": 12288},
|
||||
},
|
||||
timeout=LOCAL_TIMEOUT,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return {
|
||||
"response": resp.json().get("response", ""),
|
||||
"latency_s": round(time.time() - t0, 2),
|
||||
}
|
||||
except Exception as e:
|
||||
return {"error": str(e), "latency_s": round(time.time() - t0, 2)}
|
||||
|
||||
|
||||
def parse_graph_full(raw):
|
||||
"""Return (entities_list, edges_list, parsed_ok). Lists for metric computation."""
|
||||
cleaned = strip_json_fences(raw)
|
||||
if not cleaned:
|
||||
return None, None, False
|
||||
try:
|
||||
data = json.loads(cleaned)
|
||||
except json.JSONDecodeError:
|
||||
return None, None, False
|
||||
if not isinstance(data, dict):
|
||||
return None, None, False
|
||||
ents = data.get("entities")
|
||||
edges = data.get("edges")
|
||||
if isinstance(ents, list) and isinstance(edges, list):
|
||||
return ents, edges, True
|
||||
return None, None, False
|
||||
|
||||
|
||||
def parse_metadata(raw):
|
||||
cleaned = strip_json_fences(raw)
|
||||
if not cleaned:
|
||||
return None
|
||||
try:
|
||||
return json.loads(cleaned)
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
|
||||
|
||||
def graph_metrics(entities, edges):
|
||||
"""Compute graph quality metrics. Inputs are lists from parse_graph_full."""
|
||||
if entities is None or edges is None:
|
||||
return None
|
||||
n_entities = len(entities)
|
||||
n_edges = len(edges)
|
||||
|
||||
# Predicate diversity
|
||||
predicates = set()
|
||||
for e in edges:
|
||||
if isinstance(e, dict):
|
||||
p = e.get("predicate")
|
||||
if p:
|
||||
predicates.add(str(p).strip().lower())
|
||||
predicate_diversity = len(predicates)
|
||||
|
||||
# Entity type diversity
|
||||
types = set()
|
||||
for ent in entities:
|
||||
if isinstance(ent, dict):
|
||||
t = ent.get("type")
|
||||
if t:
|
||||
types.add(str(t).strip().lower())
|
||||
type_diversity = len(types)
|
||||
|
||||
# Average degree (edges*2 / entities — each edge touches two nodes)
|
||||
avg_degree = (2 * n_edges / n_entities) if n_entities > 0 else 0.0
|
||||
|
||||
# Largest connected component
|
||||
# Build adjacency from edges
|
||||
entity_names = set()
|
||||
for ent in entities:
|
||||
if isinstance(ent, dict):
|
||||
n = ent.get("name")
|
||||
if n:
|
||||
entity_names.add(str(n).strip().lower())
|
||||
|
||||
adj = {name: set() for name in entity_names}
|
||||
for e in edges:
|
||||
if not isinstance(e, dict):
|
||||
continue
|
||||
s = str(e.get("subject", "")).strip().lower()
|
||||
o = str(e.get("object", "")).strip().lower()
|
||||
if s in adj and o in adj:
|
||||
adj[s].add(o)
|
||||
adj[o].add(s)
|
||||
|
||||
# BFS for largest component
|
||||
visited = set()
|
||||
largest = 0
|
||||
for start in adj:
|
||||
if start in visited:
|
||||
continue
|
||||
component = 0
|
||||
stack = [start]
|
||||
while stack:
|
||||
node = stack.pop()
|
||||
if node in visited:
|
||||
continue
|
||||
visited.add(node)
|
||||
component += 1
|
||||
for neighbor in adj[node]:
|
||||
if neighbor not in visited:
|
||||
stack.append(neighbor)
|
||||
if component > largest:
|
||||
largest = component
|
||||
|
||||
return {
|
||||
"n_entities": n_entities,
|
||||
"n_edges": n_edges,
|
||||
"predicate_diversity": predicate_diversity,
|
||||
"type_diversity": type_diversity,
|
||||
"avg_degree": round(avg_degree, 2),
|
||||
"largest_component": largest,
|
||||
"largest_component_pct": round(100 * largest / n_entities, 1) if n_entities else 0.0,
|
||||
}
|
||||
|
||||
|
||||
def stratify(docs):
|
||||
"""Pick small + medium from v2; large bucket is loaded separately from
|
||||
large_bucket_sources.json (sampled fresh from pgvector since v2 has no large docs)."""
|
||||
sized = [(d, d["content_length"]) for d in docs]
|
||||
small = [d for d, n in sized if n < 1000][:15]
|
||||
medium = [d for d, n in sized if 1000 <= n < 5000][:25]
|
||||
|
||||
# Load large bucket from external sources file
|
||||
import json as _json
|
||||
large_sources_file = Path.home() / "aaronai" / "large_bucket_sources.json"
|
||||
if large_sources_file.exists():
|
||||
large_source_names = _json.loads(large_sources_file.read_text())
|
||||
# Synthesize doc_meta entries for the large sources
|
||||
large = [{"source": s, "content_length": 0, "status": "SUCCESS"}
|
||||
for s in large_source_names]
|
||||
print(f"Stratify: 15 small + 25 medium from v2, 10 large from large_bucket_sources.json")
|
||||
else:
|
||||
large = []
|
||||
print("WARN: large_bucket_sources.json not found, no large docs in sample")
|
||||
|
||||
return small + medium + large
|
||||
|
||||
|
||||
def fmt_metrics(m):
|
||||
if m is None:
|
||||
return "n/a"
|
||||
return (f"e={m['n_entities']} edge={m['n_edges']} "
|
||||
f"pred={m['predicate_diversity']} type={m['type_diversity']} "
|
||||
f"deg={m['avg_degree']} comp={m['largest_component']}/{m['n_entities']}")
|
||||
|
||||
|
||||
def main():
|
||||
api_key = os.environ.get("ANTHROPIC_API_KEY")
|
||||
pg_dsn = os.environ.get("PG_DSN")
|
||||
if not api_key or not pg_dsn:
|
||||
print("ERROR: ANTHROPIC_API_KEY or PG_DSN not set", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
if not V2_FILE.exists():
|
||||
print(f"ERROR: {V2_FILE} not found", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
with open(V2_FILE) as f:
|
||||
v2 = json.load(f)
|
||||
|
||||
docs_meta = [d for d in v2["documents"] if d.get("status") == "SUCCESS"]
|
||||
sample = stratify(docs_meta)
|
||||
print(f"Sample: {len(sample)} docs (15s/25m/10l, file order)")
|
||||
print(f"Mistral context: 12288 tokens, doc cap {MAX_DOC_CHARS} chars")
|
||||
print(f"Haiku model: {HAIKU_MODEL} temp={HAIKU_TEMPERATURE}")
|
||||
print(f"Test: base-class metadata as orienting frame, NOT entity drafting")
|
||||
print()
|
||||
|
||||
client = anthropic.Anthropic(api_key=api_key)
|
||||
pg_conn = psycopg2.connect(pg_dsn)
|
||||
|
||||
results = []
|
||||
started_at = datetime.now(timezone.utc).isoformat()
|
||||
t_total = time.time()
|
||||
|
||||
for i, doc_meta in enumerate(sample, 1):
|
||||
source = doc_meta["source"]
|
||||
doc_text, original_len = fetch_document_text(pg_conn, source)
|
||||
if not doc_text:
|
||||
print(f"[{i:02d}/{len(sample)}] {source[:55]} — SKIP (not in pgvector)")
|
||||
results.append({"source": source, "skipped": "not_in_pgvector"})
|
||||
continue
|
||||
|
||||
sent_len = len(doc_text)
|
||||
truncated = original_len > sent_len
|
||||
size_bucket = (
|
||||
"small" if sent_len < 1000
|
||||
else "medium" if sent_len < 5000
|
||||
else "large"
|
||||
)
|
||||
trunc_marker = "*" if truncated else " "
|
||||
print(f"[{i:02d}/{len(sample)}] [{size_bucket:6s}] [{sent_len:>5}c{trunc_marker}] {source[:55]}", flush=True)
|
||||
|
||||
# Condition A
|
||||
try:
|
||||
a = call_haiku(client, CONDITION_A_PROMPT + doc_text)
|
||||
a_ents, a_edges, a_ok = parse_graph_full(a["response_text"])
|
||||
a_metrics = graph_metrics(a_ents, a_edges) if a_ok else None
|
||||
print(f" A: in={a['input_tokens']} out={a['output_tokens']} "
|
||||
f"stop={a['stop_reason']} t={a['latency_s']}s", flush=True)
|
||||
print(f" {fmt_metrics(a_metrics)}", flush=True)
|
||||
except Exception as e:
|
||||
print(f" A FAILED: {e}", flush=True)
|
||||
a = {"error": str(e)}
|
||||
a_metrics = None
|
||||
|
||||
# Condition B local metadata pass
|
||||
local_result = call_local_metadata(doc_text)
|
||||
if "error" in local_result:
|
||||
print(f" B local FAILED: {local_result['error']}", flush=True)
|
||||
results.append({
|
||||
"source": source,
|
||||
"size_bucket": size_bucket,
|
||||
"doc_chars_original": original_len,
|
||||
"doc_chars_sent": sent_len,
|
||||
"truncated": truncated,
|
||||
"condition_a": {
|
||||
"input_tokens": a.get("input_tokens"),
|
||||
"output_tokens": a.get("output_tokens"),
|
||||
"latency_s": a.get("latency_s"),
|
||||
"metrics": a_metrics,
|
||||
"stop_reason": a.get("stop_reason"),
|
||||
"response_text": a.get("response_text", "")[:32000],
|
||||
"error": a.get("error"),
|
||||
},
|
||||
"condition_b": {
|
||||
"skipped": "local_model_failed",
|
||||
"local_error": local_result["error"],
|
||||
"local_latency_s": local_result.get("latency_s"),
|
||||
},
|
||||
})
|
||||
continue
|
||||
|
||||
local_raw = local_result["response"]
|
||||
metadata = parse_metadata(local_raw)
|
||||
# Override LLM-hallucinated char_length with Python-computed truth
|
||||
if metadata is not None and isinstance(metadata, dict):
|
||||
metadata["char_length"] = len(doc_text)
|
||||
print(f" B local: t={local_result['latency_s']}s metadata_parsed={metadata is not None}",
|
||||
flush=True)
|
||||
|
||||
if metadata is None:
|
||||
print(f" B: metadata parse failed — skipping API call", flush=True)
|
||||
results.append({
|
||||
"source": source,
|
||||
"size_bucket": size_bucket,
|
||||
"doc_chars_original": original_len,
|
||||
"doc_chars_sent": sent_len,
|
||||
"truncated": truncated,
|
||||
"condition_a": {
|
||||
"input_tokens": a.get("input_tokens"),
|
||||
"output_tokens": a.get("output_tokens"),
|
||||
"latency_s": a.get("latency_s"),
|
||||
"metrics": a_metrics,
|
||||
"stop_reason": a.get("stop_reason"),
|
||||
"response_text": a.get("response_text", "")[:32000],
|
||||
"error": a.get("error"),
|
||||
},
|
||||
"condition_b": {
|
||||
"skipped": "metadata_parse_failed",
|
||||
"local_latency_s": local_result.get("latency_s"),
|
||||
"local_raw": local_raw[:1000],
|
||||
},
|
||||
})
|
||||
continue
|
||||
|
||||
metadata_json = json.dumps(metadata, ensure_ascii=False, indent=2)
|
||||
b_prompt = CONDITION_B_API_PROMPT.replace("{metadata_json}", metadata_json) + doc_text
|
||||
|
||||
try:
|
||||
b = call_haiku(client, b_prompt)
|
||||
b_ents, b_edges, b_ok = parse_graph_full(b["response_text"])
|
||||
b_metrics = graph_metrics(b_ents, b_edges) if b_ok else None
|
||||
print(f" B api: in={b['input_tokens']} out={b['output_tokens']} "
|
||||
f"stop={b['stop_reason']} t={b['latency_s']}s", flush=True)
|
||||
print(f" {fmt_metrics(b_metrics)}", flush=True)
|
||||
except Exception as e:
|
||||
print(f" B api FAILED: {e}", flush=True)
|
||||
b = {"error": str(e)}
|
||||
b_metrics = None
|
||||
|
||||
# Per-doc deltas
|
||||
if "input_tokens" in a and "input_tokens" in b:
|
||||
in_pct = (b["input_tokens"] - a["input_tokens"]) / a["input_tokens"] * 100 if a["input_tokens"] else 0.0
|
||||
out_pct = (b["output_tokens"] - a["output_tokens"]) / a["output_tokens"] * 100 if a["output_tokens"] else 0.0
|
||||
edge_pct_str = "n/a"
|
||||
pred_pct_str = "n/a"
|
||||
if a_metrics and b_metrics:
|
||||
if a_metrics["n_edges"] > 0:
|
||||
edge_pct_str = f"{(b_metrics['n_edges'] - a_metrics['n_edges']) / a_metrics['n_edges'] * 100:+.1f}%"
|
||||
if a_metrics["predicate_diversity"] > 0:
|
||||
pred_pct_str = f"{(b_metrics['predicate_diversity'] - a_metrics['predicate_diversity']) / a_metrics['predicate_diversity'] * 100:+.1f}%"
|
||||
print(f" Δ in={in_pct:+.1f}% out={out_pct:+.1f}% edges={edge_pct_str} pred={pred_pct_str}",
|
||||
flush=True)
|
||||
|
||||
results.append({
|
||||
"source": source,
|
||||
"size_bucket": size_bucket,
|
||||
"doc_chars_original": original_len,
|
||||
"doc_chars_sent": sent_len,
|
||||
"truncated": truncated,
|
||||
"condition_a": {
|
||||
"input_tokens": a.get("input_tokens"),
|
||||
"output_tokens": a.get("output_tokens"),
|
||||
"latency_s": a.get("latency_s"),
|
||||
"metrics": a_metrics,
|
||||
"stop_reason": a.get("stop_reason"),
|
||||
"response_text": a.get("response_text", "")[:32000],
|
||||
"error": a.get("error"),
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": local_result.get("latency_s"),
|
||||
"local_metadata": metadata,
|
||||
"local_raw": local_raw[:1000],
|
||||
"api_input_tokens": b.get("input_tokens"),
|
||||
"api_output_tokens": b.get("output_tokens"),
|
||||
"api_latency_s": b.get("latency_s"),
|
||||
"metrics": b_metrics,
|
||||
"stop_reason": b.get("stop_reason"),
|
||||
"response_text": b.get("response_text", "")[:32000],
|
||||
"error": b.get("error"),
|
||||
},
|
||||
})
|
||||
|
||||
pg_conn.close()
|
||||
total_elapsed = round(time.time() - t_total, 1)
|
||||
|
||||
valid = [r for r in results
|
||||
if r.get("condition_a", {}).get("metrics") is not None
|
||||
and r.get("condition_b", {}).get("metrics") is not None]
|
||||
|
||||
a_in = sum(r["condition_a"]["input_tokens"] for r in valid)
|
||||
a_out = sum(r["condition_a"]["output_tokens"] for r in valid)
|
||||
b_in = sum(r["condition_b"]["api_input_tokens"] for r in valid)
|
||||
b_out = sum(r["condition_b"]["api_output_tokens"] for r in valid)
|
||||
a_cost = (a_in * HAIKU_IN_PER_M + a_out * HAIKU_OUT_PER_M) / 1_000_000
|
||||
b_cost = (b_in * HAIKU_IN_PER_M + b_out * HAIKU_OUT_PER_M) / 1_000_000
|
||||
|
||||
def avg_metric(rows, condition, key):
|
||||
vals = [r[condition]["metrics"][key] for r in rows if r[condition]["metrics"]]
|
||||
return round(statistics.mean(vals), 2) if vals else None
|
||||
|
||||
by_bucket = {}
|
||||
for bucket in ("small", "medium", "large"):
|
||||
rows = [r for r in valid if r["size_bucket"] == bucket]
|
||||
if not rows:
|
||||
by_bucket[bucket] = None
|
||||
continue
|
||||
ai = sum(r["condition_a"]["input_tokens"] for r in rows)
|
||||
ao = sum(r["condition_a"]["output_tokens"] for r in rows)
|
||||
bi = sum(r["condition_b"]["api_input_tokens"] for r in rows)
|
||||
bo = sum(r["condition_b"]["api_output_tokens"] for r in rows)
|
||||
by_bucket[bucket] = {
|
||||
"n": len(rows),
|
||||
"input_delta_pct": round((bi - ai) / ai * 100, 2) if ai else None,
|
||||
"output_delta_pct": round((bo - ao) / ao * 100, 2) if ao else None,
|
||||
"a_avg_entities": avg_metric(rows, "condition_a", "n_entities"),
|
||||
"b_avg_entities": avg_metric(rows, "condition_b", "n_entities"),
|
||||
"a_avg_edges": avg_metric(rows, "condition_a", "n_edges"),
|
||||
"b_avg_edges": avg_metric(rows, "condition_b", "n_edges"),
|
||||
"a_avg_predicate_diversity": avg_metric(rows, "condition_a", "predicate_diversity"),
|
||||
"b_avg_predicate_diversity": avg_metric(rows, "condition_b", "predicate_diversity"),
|
||||
"a_avg_type_diversity": avg_metric(rows, "condition_a", "type_diversity"),
|
||||
"b_avg_type_diversity": avg_metric(rows, "condition_b", "type_diversity"),
|
||||
"a_avg_degree": avg_metric(rows, "condition_a", "avg_degree"),
|
||||
"b_avg_degree": avg_metric(rows, "condition_b", "avg_degree"),
|
||||
"a_avg_largest_component_pct": avg_metric(rows, "condition_a", "largest_component_pct"),
|
||||
"b_avg_largest_component_pct": avg_metric(rows, "condition_b", "largest_component_pct"),
|
||||
}
|
||||
|
||||
summary = {
|
||||
"experiment": "base_class_test",
|
||||
"title": "Base-Class Enrichment — OOP Framing",
|
||||
"started_at": started_at,
|
||||
"completed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"haiku_model": HAIKU_MODEL,
|
||||
"local_model": LOCAL_MODEL,
|
||||
"max_doc_chars": MAX_DOC_CHARS,
|
||||
"n_documents": len(sample),
|
||||
"n_valid_pairs": len(valid),
|
||||
"total_elapsed_s": total_elapsed,
|
||||
"totals": {
|
||||
"a_input_tokens": a_in,
|
||||
"a_output_tokens": a_out,
|
||||
"b_input_tokens": b_in,
|
||||
"b_output_tokens": b_out,
|
||||
"a_cost_usd": round(a_cost, 4),
|
||||
"b_cost_usd": round(b_cost, 4),
|
||||
"cost_delta_usd": round(b_cost - a_cost, 4),
|
||||
"cost_delta_pct": round((b_cost - a_cost) / a_cost * 100, 2) if a_cost else None,
|
||||
"note": "API cost only — local Mistral runtime on VPS not monetized",
|
||||
},
|
||||
"by_size_bucket": by_bucket,
|
||||
"results": results,
|
||||
}
|
||||
|
||||
OUTPUT_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(OUTPUT_FILE, "w") as f:
|
||||
json.dump(summary, f, indent=2)
|
||||
|
||||
print()
|
||||
print("=" * 60)
|
||||
print(f"DONE — {len(valid)}/{len(sample)} valid pairs in {total_elapsed}s")
|
||||
print(f"A total cost: ${a_cost:.4f} (in={a_in} out={a_out})")
|
||||
print(f"B total cost: ${b_cost:.4f} (in={b_in} out={b_out})")
|
||||
delta_pct = summary['totals']['cost_delta_pct']
|
||||
if delta_pct is not None:
|
||||
verdict = "B cheaper" if delta_pct < 0 else "B more expensive"
|
||||
print(f"Cost delta: {delta_pct:+.2f}% ({verdict})")
|
||||
print()
|
||||
print("By bucket — graph metrics (A vs B):")
|
||||
for bucket, stats in by_bucket.items():
|
||||
if stats:
|
||||
print(f" {bucket:6s} (n={stats['n']}):")
|
||||
print(f" cost: in {stats['input_delta_pct']:+.1f}% out {stats['output_delta_pct']:+.1f}%")
|
||||
print(f" entities: A={stats['a_avg_entities']} B={stats['b_avg_entities']}")
|
||||
print(f" edges: A={stats['a_avg_edges']} B={stats['b_avg_edges']}")
|
||||
print(f" predicate diversity: A={stats['a_avg_predicate_diversity']} B={stats['b_avg_predicate_diversity']}")
|
||||
print(f" type diversity: A={stats['a_avg_type_diversity']} B={stats['b_avg_type_diversity']}")
|
||||
print(f" avg degree: A={stats['a_avg_degree']} B={stats['b_avg_degree']}")
|
||||
print(f" largest component %: A={stats['a_avg_largest_component_pct']} B={stats['b_avg_largest_component_pct']}")
|
||||
print()
|
||||
print(f"Results: {OUTPUT_FILE}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,376 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
BirdAI Briefing Generator v2 — Experiment 002b
|
||||
===============================================
|
||||
Changes from v1 (based on Experiment 004 human evaluation):
|
||||
- document_type now pre-classified by rule, not by model
|
||||
- Capture template header stripped before model sees content
|
||||
- noise_signals constrained to controlled vocabulary
|
||||
- Model prompt simplified — focuses only on reliable signal fields
|
||||
- Expanded document type vocabulary for BirdAI-specific types
|
||||
Results written to ~/aaronai/briefing_test_v2_results.json
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import urllib.request
|
||||
import urllib.error
|
||||
import psycopg2
|
||||
import psycopg2.extras
|
||||
import hashlib
|
||||
import time
|
||||
from datetime import datetime, timedelta
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(os.path.expanduser("~/aaronai/.env"))
|
||||
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
RESULTS_FILE = os.path.expanduser("~/aaronai/briefing_test_v2_results.json")
|
||||
MODEL = "mistral"
|
||||
SAMPLE_SIZE = 50
|
||||
OLLAMA_URL = "http://localhost:11434/api/generate"
|
||||
|
||||
VALID_DOC_TYPES = {
|
||||
"voice_capture", "image_capture",
|
||||
"dream_nrem", "dream_rem", "dream_lucid", "dream_synthesis",
|
||||
"presentation", "code", "spreadsheet",
|
||||
"academic_pdf", "technical_doc", "chat_log",
|
||||
"book_excerpt", "form", "syllabus", "email",
|
||||
"notes", "purchase_order", "annual_report",
|
||||
"invoice", "memo", "report", "unknown"
|
||||
}
|
||||
|
||||
VALID_DENSITIES = {"high", "medium", "low"}
|
||||
VALID_PRIORITIES = {"full", "partial", "skip"}
|
||||
|
||||
VALID_NOISE_SIGNALS = {
|
||||
"repeated_headers", "page_numbers", "formatting_artifacts",
|
||||
"boilerplate", "watermarks", "footers", "line_numbers",
|
||||
"encoding_artifacts", "ocr_errors"
|
||||
}
|
||||
|
||||
VALID_STRUCTURE_SIGNALS = {
|
||||
"headings", "bullet_lists", "numbered_lists", "tables",
|
||||
"code_blocks", "citations", "footnotes", "images",
|
||||
"forms", "columns", "sections"
|
||||
}
|
||||
|
||||
|
||||
def pre_classify_document(source, content):
|
||||
filename = os.path.basename(source).lower()
|
||||
doc_type = None
|
||||
cleaned_content = content
|
||||
|
||||
if "---" in content:
|
||||
parts = content.split("---", 1)
|
||||
header = parts[0].lower()
|
||||
body = parts[1].strip() if len(parts) > 1 else content
|
||||
if any(marker in header for marker in ["**type:**", "**modality:**", "# capture", "# dream"]):
|
||||
cleaned_content = body if body else content
|
||||
|
||||
if "nrem" in filename:
|
||||
doc_type = "dream_nrem"
|
||||
elif "lucid" in filename:
|
||||
doc_type = "dream_lucid"
|
||||
elif "-rem-" in filename or filename.endswith("-rem.md"):
|
||||
doc_type = "dream_rem"
|
||||
elif "synthesis" in filename and filename.endswith(".md"):
|
||||
doc_type = "dream_synthesis"
|
||||
elif "-voice" in filename or "voice-" in filename:
|
||||
doc_type = "voice_capture"
|
||||
elif "-image" in filename or "image-" in filename:
|
||||
doc_type = "image_capture"
|
||||
elif filename.endswith(".pptx") or filename.endswith(".ppt"):
|
||||
doc_type = "presentation"
|
||||
elif filename.endswith(".xlsx") or filename.endswith(".xls") or filename.endswith(".csv"):
|
||||
doc_type = "spreadsheet"
|
||||
elif any(filename.endswith(ext) for ext in [".py", ".js", ".ts", ".cpp", ".c", ".h", ".java", ".rs"]):
|
||||
doc_type = "code"
|
||||
elif filename.endswith("cmakelists.txt") or filename == "makefile":
|
||||
doc_type = "code"
|
||||
elif content.startswith("# Dream"):
|
||||
if "nrem" in content[:50].lower():
|
||||
doc_type = "dream_nrem"
|
||||
elif "lucid" in content[:50].lower():
|
||||
doc_type = "dream_lucid"
|
||||
elif "rem" in content[:50].lower():
|
||||
doc_type = "dream_rem"
|
||||
else:
|
||||
doc_type = "dream_synthesis"
|
||||
elif content.startswith("# Capture"):
|
||||
doc_type = "voice_capture" if "voice" in content[:100].lower() else "image_capture"
|
||||
|
||||
return doc_type, cleaned_content
|
||||
|
||||
|
||||
def build_briefing_prompt(content, pre_classified_type=None):
|
||||
if pre_classified_type:
|
||||
type_instruction = f'\n "document_type": "{pre_classified_type}", // pre-classified, do not change'
|
||||
else:
|
||||
type_instruction = '\n "document_type": "one of: academic_pdf, technical_doc, chat_log, book_excerpt, form, syllabus, email, notes, purchase_order, annual_report, invoice, memo, report, unknown",'
|
||||
|
||||
return f"""Analyze this document and return a JSON briefing. No explanation, no prose, JSON only.
|
||||
|
||||
Return exactly this structure:
|
||||
{{{type_instruction}
|
||||
"primary_language": "language code e.g. en, fr, de",
|
||||
"density": "one of: high, medium, low",
|
||||
"has_proper_nouns": true or false,
|
||||
"has_dates": true or false,
|
||||
"has_numeric_data": true or false,
|
||||
"has_institutional_language": true or false,
|
||||
"has_technical_terms": true or false,
|
||||
"likely_has_named_entities": true or false,
|
||||
"structure_signals": [],
|
||||
"noise_signals": [],
|
||||
"extraction_priority": "one of: full, partial, skip"
|
||||
}}
|
||||
|
||||
Rules:
|
||||
- density: high=information dense technical or academic, medium=mixed, low=narrative/literary/sparse/short
|
||||
- has_proper_nouns: true if you see capitalized words that are NOT sentence starts or template headers
|
||||
- has_dates: true if you see date patterns (numbers with months, years, slashes)
|
||||
- has_numeric_data: true if you see measurements, percentages, statistics
|
||||
- has_institutional_language: true if you see words like university, department, policy, committee, grant
|
||||
- has_technical_terms: true if you see domain-specific jargon or acronyms
|
||||
- likely_has_named_entities: true if has_proper_nouns is true
|
||||
- structure_signals: use ONLY these terms: headings, bullet_lists, numbered_lists, tables, code_blocks, citations, footnotes, images, forms, columns, sections
|
||||
- noise_signals: use ONLY these terms: repeated_headers, page_numbers, formatting_artifacts, boilerplate, watermarks, footers, line_numbers, encoding_artifacts, ocr_errors
|
||||
- extraction_priority: full if density=high and likely_has_named_entities=true; skip if density=low AND likely_has_named_entities=false AND content is under 200 words; partial otherwise
|
||||
|
||||
Document:
|
||||
{content[:1500]}"""
|
||||
|
||||
|
||||
def get_sample_documents():
|
||||
if not PG_DSN:
|
||||
raise RuntimeError("PG_DSN not found in .env — cannot connect to database")
|
||||
conn = psycopg2.connect(PG_DSN)
|
||||
cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
|
||||
cur.execute("""
|
||||
SELECT DISTINCT ON (source) id, document, source, created_at
|
||||
FROM embeddings
|
||||
WHERE length(document) > 100
|
||||
AND length(document) < 3000
|
||||
ORDER BY source, random()
|
||||
LIMIT %s
|
||||
""", (SAMPLE_SIZE,))
|
||||
docs = cur.fetchall()
|
||||
cur.close()
|
||||
conn.close()
|
||||
return docs
|
||||
|
||||
|
||||
def run_briefing(prompt):
|
||||
payload = json.dumps({"model": MODEL, "prompt": prompt, "stream": False}).encode()
|
||||
raw = ""
|
||||
try:
|
||||
req = urllib.request.Request(OLLAMA_URL, data=payload, headers={"Content-Type": "application/json"})
|
||||
with urllib.request.urlopen(req, timeout=180) as resp:
|
||||
result = json.loads(resp.read().decode())
|
||||
raw = result.get("response", "").strip()
|
||||
start = raw.find("{")
|
||||
end = raw.rfind("}") + 1
|
||||
if start == -1 or end == 0:
|
||||
return None, f"NO_JSON: {raw[:200]}"
|
||||
parsed = json.loads(raw[start:end])
|
||||
if not isinstance(parsed, dict):
|
||||
return None, f"NOT_DICT: {raw[:100]}"
|
||||
return parsed, raw
|
||||
except urllib.error.URLError as e:
|
||||
return None, f"URL_ERROR: {e}"
|
||||
except TimeoutError:
|
||||
return None, "TIMEOUT"
|
||||
except json.JSONDecodeError as e:
|
||||
return None, f"JSON_ERROR: {e} | raw: {raw[:200]}"
|
||||
except Exception as e:
|
||||
return None, f"ERROR: {type(e).__name__}: {e}"
|
||||
|
||||
|
||||
def sanitize_briefing(briefing, pre_classified_type=None):
|
||||
safe = {}
|
||||
if pre_classified_type:
|
||||
safe["document_type"] = pre_classified_type
|
||||
else:
|
||||
dt = str(briefing.get("document_type", "unknown")).lower().strip()
|
||||
safe["document_type"] = dt if dt in VALID_DOC_TYPES else "unknown"
|
||||
safe["primary_language"] = str(briefing.get("primary_language", "en")).lower().strip()[:10]
|
||||
density = str(briefing.get("density", "medium")).lower().strip()
|
||||
safe["density"] = density if density in VALID_DENSITIES else "medium"
|
||||
for field in ["has_proper_nouns", "has_dates", "has_numeric_data",
|
||||
"has_institutional_language", "has_technical_terms", "likely_has_named_entities"]:
|
||||
val = briefing.get(field, False)
|
||||
if isinstance(val, bool):
|
||||
safe[field] = val
|
||||
elif isinstance(val, str):
|
||||
safe[field] = val.lower() in ("true", "yes", "1")
|
||||
else:
|
||||
safe[field] = bool(val)
|
||||
for field, valid_set in [("structure_signals", VALID_STRUCTURE_SIGNALS),
|
||||
("noise_signals", VALID_NOISE_SIGNALS)]:
|
||||
val = briefing.get(field, [])
|
||||
if isinstance(val, list):
|
||||
safe[field] = [str(v).lower().strip() for v in val if str(v).lower().strip() in valid_set]
|
||||
elif isinstance(val, str) and val.lower().strip() in valid_set:
|
||||
safe[field] = [val.lower().strip()]
|
||||
else:
|
||||
safe[field] = []
|
||||
priority = str(briefing.get("extraction_priority", "partial")).lower().strip()
|
||||
safe["extraction_priority"] = priority if priority in VALID_PRIORITIES else "partial"
|
||||
return safe
|
||||
|
||||
|
||||
def estimate_token_reduction(original_text, briefing):
|
||||
original_tokens = max(len(original_text) / 4, 1)
|
||||
orientation_saved = 200
|
||||
if briefing.get("extraction_priority") == "skip":
|
||||
return {"original_tokens_approx": round(original_tokens),
|
||||
"orientation_tokens_saved": round(original_tokens + 200),
|
||||
"noise_reduction_pct": 100.0, "total_reduction_pct": 100.0,
|
||||
"note": "skip — no API call"}
|
||||
noise_count = len(briefing.get("noise_signals", []))
|
||||
noise_reduction_pct = min(noise_count * 0.05, 0.40)
|
||||
noise_tokens_saved = original_tokens * noise_reduction_pct
|
||||
total_saved = orientation_saved + noise_tokens_saved
|
||||
reduction_pct = min((total_saved / (original_tokens + 200)) * 100, 99.0)
|
||||
return {"original_tokens_approx": round(original_tokens),
|
||||
"orientation_tokens_saved": orientation_saved,
|
||||
"noise_tokens_saved": round(noise_tokens_saved),
|
||||
"noise_reduction_pct": round(noise_reduction_pct * 100, 1),
|
||||
"total_reduction_pct": round(reduction_pct, 1)}
|
||||
|
||||
|
||||
def format_eta(elapsed_times, completed, total):
|
||||
if completed == 0:
|
||||
return "ETA: --:--"
|
||||
avg = sum(elapsed_times) / completed
|
||||
eta = timedelta(seconds=int((total - completed) * avg))
|
||||
return f"ETA: {str(eta)}"
|
||||
|
||||
|
||||
def content_hash(text):
|
||||
return hashlib.md5(text.encode()).hexdigest()[:8]
|
||||
|
||||
|
||||
def main():
|
||||
test_start = time.time()
|
||||
print(f"\nBirdAI Briefing Generator v2 — Experiment 002b")
|
||||
print(f"Model: {MODEL} | Sample: {SAMPLE_SIZE} docs (distinct sources)")
|
||||
print(f"Changes: rule-based doc_type, template stripping, controlled vocab")
|
||||
print(f"Started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
|
||||
print(f"Results: {RESULTS_FILE}")
|
||||
print("-" * 75)
|
||||
|
||||
docs = get_sample_documents()
|
||||
print(f"Loaded {len(docs)} distinct source documents from pgvector\n")
|
||||
|
||||
results = {
|
||||
"meta": {"model": MODEL, "version": "v2", "sample_size": len(docs),
|
||||
"started": datetime.now().isoformat(), "completed": None,
|
||||
"total_elapsed_seconds": None, "avg_seconds_per_doc": None},
|
||||
"documents": [], "summary": {}
|
||||
}
|
||||
|
||||
success_count = 0
|
||||
failed_count = 0
|
||||
pre_classified_count = 0
|
||||
priority_counts = {"full": 0, "partial": 0, "skip": 0}
|
||||
total_reduction_pct = 0.0
|
||||
elapsed_times = []
|
||||
|
||||
for i, doc in enumerate(docs):
|
||||
doc_id = doc["id"]
|
||||
content = doc["document"]
|
||||
source = doc.get("source", "unknown")
|
||||
chash = content_hash(content)
|
||||
|
||||
pre_type, cleaned_content = pre_classify_document(source, content)
|
||||
was_pre_classified = pre_type is not None
|
||||
if was_pre_classified:
|
||||
pre_classified_count += 1
|
||||
|
||||
eta_str = format_eta(elapsed_times, i, len(docs))
|
||||
pre_flag = "R" if was_pre_classified else "M"
|
||||
print(f"[{i+1:02d}/{len(docs)}][{pre_flag}] {source[:36]:<36} {eta_str:<14}", end=" ", flush=True)
|
||||
|
||||
prompt = build_briefing_prompt(cleaned_content, pre_type)
|
||||
t_start = time.time()
|
||||
briefing, raw = run_briefing(prompt)
|
||||
elapsed = round(time.time() - t_start, 1)
|
||||
elapsed_times.append(elapsed)
|
||||
|
||||
if briefing is None:
|
||||
failed_count += 1
|
||||
print(f"→ FAILED {elapsed}s | {raw[:50]}")
|
||||
results["documents"].append({
|
||||
"id": doc_id, "source": source, "content_hash": chash,
|
||||
"content_length": len(content), "status": "FAILED",
|
||||
"pre_classified_type": pre_type, "error": raw, "elapsed_seconds": elapsed
|
||||
})
|
||||
else:
|
||||
briefing = sanitize_briefing(briefing, pre_type)
|
||||
success_count += 1
|
||||
priority = briefing["extraction_priority"]
|
||||
doc_type = briefing["document_type"]
|
||||
density = briefing["density"]
|
||||
priority_counts[priority] = priority_counts.get(priority, 0) + 1
|
||||
reduction = estimate_token_reduction(cleaned_content, briefing)
|
||||
total_reduction_pct += reduction["total_reduction_pct"]
|
||||
print(f"→ {priority.upper():<7} {doc_type:<15} density:{density:<6} -{reduction['total_reduction_pct']:>5.1f}% {elapsed}s")
|
||||
results["documents"].append({
|
||||
"id": doc_id, "source": source, "content_hash": chash,
|
||||
"content_length": len(content), "cleaned_content_length": len(cleaned_content),
|
||||
"status": "SUCCESS", "pre_classified_type": pre_type,
|
||||
"was_pre_classified": was_pre_classified, "elapsed_seconds": elapsed,
|
||||
"briefing": briefing, "token_reduction_estimate": reduction
|
||||
})
|
||||
|
||||
with open(RESULTS_FILE, "w") as f:
|
||||
json.dump(results, f, indent=2, default=str)
|
||||
|
||||
total_elapsed = round(time.time() - test_start, 1)
|
||||
avg_per_doc = round(total_elapsed / len(docs), 1) if docs else 0
|
||||
completed_at = datetime.now().isoformat()
|
||||
results["meta"]["completed"] = completed_at
|
||||
results["meta"]["total_elapsed_seconds"] = total_elapsed
|
||||
results["meta"]["avg_seconds_per_doc"] = avg_per_doc
|
||||
|
||||
total = len(docs)
|
||||
avg_reduction = round(total_reduction_pct / success_count, 1) if success_count else 0
|
||||
summary = {
|
||||
"total": total, "success": success_count, "failed": failed_count,
|
||||
"success_rate": round(success_count / total * 100, 1),
|
||||
"pre_classified_by_rule": pre_classified_count,
|
||||
"classified_by_model": total - pre_classified_count,
|
||||
"extraction_priority_breakdown": priority_counts,
|
||||
"avg_token_reduction_pct": avg_reduction,
|
||||
"total_elapsed_seconds": total_elapsed, "avg_seconds_per_doc": avg_per_doc,
|
||||
"projected_50_doc_minutes": round((avg_per_doc * 50) / 60, 1),
|
||||
"approach_viable": success_count / total >= 0.8
|
||||
}
|
||||
results["summary"] = summary
|
||||
with open(RESULTS_FILE, "w") as f:
|
||||
json.dump(results, f, indent=2, default=str)
|
||||
|
||||
print("\n" + "=" * 75)
|
||||
print(f"RESULTS — Briefing Generator v2")
|
||||
print(f" Success rate: {success_count}/{total} ({summary['success_rate']}%)")
|
||||
print(f" Failed: {failed_count}")
|
||||
print(f" Pre-classified (rule): {pre_classified_count}")
|
||||
print(f" Classified (model): {total - pre_classified_count}")
|
||||
print(f" Priority — full: {priority_counts.get('full', 0)}")
|
||||
print(f" Priority — partial: {priority_counts.get('partial', 0)}")
|
||||
print(f" Priority — skip: {priority_counts.get('skip', 0)}")
|
||||
print(f" Avg token reduction: {avg_reduction}%")
|
||||
print(f" Total elapsed: {total_elapsed}s ({round(total_elapsed/60, 1)} min)")
|
||||
print(f" Avg per document: {avg_per_doc}s")
|
||||
print(f" Projected 50 docs: {summary['projected_50_doc_minutes']} min")
|
||||
print(f" Approach viable: {'YES' if summary['approach_viable'] else 'NO'}")
|
||||
print(f" Completed: {completed_at}")
|
||||
print(f" Full results: {RESULTS_FILE}")
|
||||
print("=" * 75)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,508 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Cascade Optimization Test — skip-small + compressed-draft
|
||||
|
||||
Tests whether two optimizations on the entity-drafter cascade meaningfully
|
||||
improve the savings ceiling beyond the prior unoptimized cascade (12.66%).
|
||||
|
||||
Optimizations:
|
||||
A — Skip-small-docs routing: docs <1000 chars bypass the local pass entirely
|
||||
B — Compressed draft format: bare JSON array instead of markdown bullets
|
||||
|
||||
Conditions:
|
||||
A — Baseline: single Claude Haiku call, full extraction (unchanged from prior)
|
||||
B — Optimized cascade: skip-small + compressed draft, otherwise same cascade
|
||||
|
||||
Sample: 30 docs from briefing_test_v2_results.json:
|
||||
- 10 small (<1000 chars) — should show 0% delta if skip-small works
|
||||
- 12 medium (1000-5000 chars) — primary test bucket
|
||||
- 8 large (5000-12000 chars, capped at 12K)
|
||||
|
||||
Mistral context: 12K (raised from 8K in prior run).
|
||||
|
||||
Outputs: ~/aaronai/experiments/cascade_optimization_results.json
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import statistics
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import anthropic
|
||||
import psycopg2
|
||||
import requests
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
V2_FILE = Path.home() / "aaronai" / "briefing_test_v2_results.json"
|
||||
OUTPUT_FILE = Path.home() / "aaronai" / "experiments" / "cascade_optimization_results.json"
|
||||
HAIKU_MODEL = "claude-haiku-4-5-20251001"
|
||||
HAIKU_MAX_TOKENS = 4096
|
||||
HAIKU_TEMPERATURE = 0.0
|
||||
OLLAMA_URL = "http://localhost:11434/api/generate"
|
||||
LOCAL_MODEL = "mistral"
|
||||
LOCAL_TIMEOUT = 180 # raised — 12K context can take longer
|
||||
MAX_DOC_CHARS = 12000 # raised from 8K
|
||||
SKIP_SMALL_THRESHOLD = 1000
|
||||
|
||||
HAIKU_IN_PER_M = 1.0
|
||||
HAIKU_OUT_PER_M = 5.0
|
||||
|
||||
|
||||
CONDITION_A_PROMPT = """Extract a knowledge graph from the document below.
|
||||
|
||||
Return ONLY valid JSON with this exact schema:
|
||||
{
|
||||
"entities": [
|
||||
{"name": string, "type": string}
|
||||
],
|
||||
"edges": [
|
||||
{"subject": string, "predicate": string, "object": string}
|
||||
]
|
||||
}
|
||||
|
||||
Entity types: use whatever fits the entity. Do not constrain yourself to a fixed list.
|
||||
|
||||
Edge predicates: natural language phrases that capture the actual relationship the document states or implies.
|
||||
|
||||
Extract every entity and every relationship the document states or strongly implies. Both subject and object in every edge must appear in entities. JSON only, no commentary, no markdown fences.
|
||||
|
||||
DOCUMENT:
|
||||
"""
|
||||
|
||||
LOCAL_PROMPT = """List every named entity that appears in the document below — every person, organization, place, project, document, material, technique, date, event, or other named thing.
|
||||
|
||||
Return ONLY valid JSON:
|
||||
{
|
||||
"candidates": [string]
|
||||
}
|
||||
|
||||
Just names. No types, no relationships. JSON only.
|
||||
|
||||
DOCUMENT:
|
||||
"""
|
||||
|
||||
# Compressed draft format — bare JSON array, minimal preamble
|
||||
CONDITION_B_API_PROMPT_COMPRESSED = """Extract a knowledge graph from the document below.
|
||||
|
||||
Local model entity candidates (hint, not authoritative — verify against the document, ignore false ones, add missed ones):
|
||||
{local_draft_json}
|
||||
|
||||
Return ONLY valid JSON with this exact schema:
|
||||
{
|
||||
"entities": [
|
||||
{"name": string, "type": string}
|
||||
],
|
||||
"edges": [
|
||||
{"subject": string, "predicate": string, "object": string}
|
||||
]
|
||||
}
|
||||
|
||||
Entity types: use whatever fits. Edge predicates: natural language phrases capturing the actual relationship. Both subject and object in every edge must appear in entities. Extract every entity and every relationship the document states or strongly implies. JSON only, no commentary, no markdown fences.
|
||||
|
||||
DOCUMENT:
|
||||
"""
|
||||
|
||||
|
||||
def strip_json_fences(text):
|
||||
if not text:
|
||||
return ""
|
||||
t = text.strip()
|
||||
t = re.sub(r"^```(?:json)?\s*", "", t)
|
||||
t = re.sub(r"\s*```$", "", t)
|
||||
return t.strip()
|
||||
|
||||
|
||||
def fetch_document_text(pg_conn, source):
|
||||
cur = pg_conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT document FROM embeddings WHERE source = %s ORDER BY id",
|
||||
(source,),
|
||||
)
|
||||
rows = cur.fetchall()
|
||||
cur.close()
|
||||
if not rows:
|
||||
return None, 0
|
||||
full = "\n\n".join(r[0] for r in rows)
|
||||
return full[:MAX_DOC_CHARS], len(full)
|
||||
|
||||
|
||||
def call_haiku(client, prompt_text):
|
||||
t0 = time.time()
|
||||
resp = client.messages.create(
|
||||
model=HAIKU_MODEL,
|
||||
max_tokens=HAIKU_MAX_TOKENS,
|
||||
temperature=HAIKU_TEMPERATURE,
|
||||
messages=[{"role": "user", "content": prompt_text}],
|
||||
)
|
||||
return {
|
||||
"input_tokens": resp.usage.input_tokens,
|
||||
"output_tokens": resp.usage.output_tokens,
|
||||
"latency_s": round(time.time() - t0, 2),
|
||||
"response_text": resp.content[0].text if resp.content else "",
|
||||
"stop_reason": resp.stop_reason,
|
||||
}
|
||||
|
||||
|
||||
def call_local(document_text):
|
||||
t0 = time.time()
|
||||
try:
|
||||
resp = requests.post(
|
||||
OLLAMA_URL,
|
||||
json={
|
||||
"model": LOCAL_MODEL,
|
||||
"prompt": LOCAL_PROMPT + document_text,
|
||||
"stream": False,
|
||||
"format": "json",
|
||||
"options": {"num_predict": 1024, "temperature": 0, "num_ctx": 12288},
|
||||
},
|
||||
timeout=LOCAL_TIMEOUT,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return {
|
||||
"response": resp.json().get("response", ""),
|
||||
"latency_s": round(time.time() - t0, 2),
|
||||
}
|
||||
except Exception as e:
|
||||
return {"error": str(e), "latency_s": round(time.time() - t0, 2)}
|
||||
|
||||
|
||||
def parse_graph(raw):
|
||||
cleaned = strip_json_fences(raw)
|
||||
if not cleaned:
|
||||
return None, None
|
||||
try:
|
||||
data = json.loads(cleaned)
|
||||
except json.JSONDecodeError:
|
||||
return None, None
|
||||
if not isinstance(data, dict):
|
||||
return None, None
|
||||
ents = data.get("entities")
|
||||
edges = data.get("edges")
|
||||
if isinstance(ents, list) and isinstance(edges, list):
|
||||
return len(ents), len(edges)
|
||||
return None, None
|
||||
|
||||
|
||||
def parse_candidates(raw):
|
||||
cleaned = strip_json_fences(raw)
|
||||
if not cleaned:
|
||||
return None
|
||||
try:
|
||||
data = json.loads(cleaned)
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
if not isinstance(data, dict):
|
||||
return None
|
||||
cands = data.get("candidates")
|
||||
if isinstance(cands, list):
|
||||
return [str(c).strip() for c in cands if c]
|
||||
return None
|
||||
|
||||
|
||||
def stratify(docs):
|
||||
"""Pick 10 small / 12 medium / 8 large by character length, in file order."""
|
||||
sized = [(d, d["content_length"]) for d in docs]
|
||||
small = [d for d, n in sized if n < 1000]
|
||||
medium = [d for d, n in sized if 1000 <= n < 5000]
|
||||
large = [d for d, n in sized if n >= 5000]
|
||||
return small[:10] + medium[:12] + large[:8]
|
||||
|
||||
|
||||
def main():
|
||||
api_key = os.environ.get("ANTHROPIC_API_KEY")
|
||||
pg_dsn = os.environ.get("PG_DSN")
|
||||
if not api_key or not pg_dsn:
|
||||
print("ERROR: ANTHROPIC_API_KEY or PG_DSN not set", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
if not V2_FILE.exists():
|
||||
print(f"ERROR: {V2_FILE} not found", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
with open(V2_FILE) as f:
|
||||
v2 = json.load(f)
|
||||
|
||||
docs_meta = [d for d in v2["documents"] if d.get("status") == "SUCCESS"]
|
||||
sample = stratify(docs_meta)
|
||||
print(f"Sample: {len(sample)} docs (10s/12m/8l, file order)")
|
||||
print(f"Skip-small threshold: <{SKIP_SMALL_THRESHOLD} chars")
|
||||
print(f"Mistral context: 12288 tokens, doc cap {MAX_DOC_CHARS} chars")
|
||||
print(f"Haiku model: {HAIKU_MODEL} temp={HAIKU_TEMPERATURE} max_tokens={HAIKU_MAX_TOKENS}")
|
||||
print()
|
||||
|
||||
client = anthropic.Anthropic(api_key=api_key)
|
||||
pg_conn = psycopg2.connect(pg_dsn)
|
||||
|
||||
results = []
|
||||
started_at = datetime.now(timezone.utc).isoformat()
|
||||
t_total = time.time()
|
||||
|
||||
for i, doc_meta in enumerate(sample, 1):
|
||||
source = doc_meta["source"]
|
||||
doc_text, original_len = fetch_document_text(pg_conn, source)
|
||||
if not doc_text:
|
||||
print(f"[{i:02d}/{len(sample)}] {source[:55]} — SKIP (not in pgvector)")
|
||||
results.append({"source": source, "skipped": "not_in_pgvector"})
|
||||
continue
|
||||
|
||||
sent_len = len(doc_text)
|
||||
truncated = original_len > sent_len
|
||||
size_bucket = (
|
||||
"small" if sent_len < 1000
|
||||
else "medium" if sent_len < 5000
|
||||
else "large"
|
||||
)
|
||||
skip_small_routed = sent_len < SKIP_SMALL_THRESHOLD
|
||||
trunc_marker = "*" if truncated else " "
|
||||
route_marker = "[skip-small]" if skip_small_routed else "[cascade] "
|
||||
print(f"[{i:02d}/{len(sample)}] [{size_bucket:6s}] [{sent_len:>5}c{trunc_marker}] "
|
||||
f"{route_marker} {source[:50]}", flush=True)
|
||||
|
||||
# Condition A — always runs
|
||||
try:
|
||||
a = call_haiku(client, CONDITION_A_PROMPT + doc_text)
|
||||
a_ents, a_edges = parse_graph(a["response_text"])
|
||||
print(f" A: in={a['input_tokens']} out={a['output_tokens']} "
|
||||
f"ents={a_ents} edges={a_edges} stop={a['stop_reason']} t={a['latency_s']}s",
|
||||
flush=True)
|
||||
except Exception as e:
|
||||
print(f" A FAILED: {e}", flush=True)
|
||||
a = {"error": str(e)}
|
||||
a_ents = a_edges = None
|
||||
|
||||
# Condition B
|
||||
if skip_small_routed:
|
||||
# Skip-small: B = A. Same call, no local pass.
|
||||
print(f" B: routed to baseline (skip-small)", flush=True)
|
||||
b = a
|
||||
b_ents = a_ents
|
||||
b_edges = a_edges
|
||||
local_result = {"skipped": "skip_small_routed"}
|
||||
local_candidates = []
|
||||
local_raw = ""
|
||||
else:
|
||||
local_result = call_local(doc_text)
|
||||
if "error" in local_result:
|
||||
print(f" B local FAILED: {local_result['error']} — recording skip", flush=True)
|
||||
results.append({
|
||||
"source": source,
|
||||
"size_bucket": size_bucket,
|
||||
"doc_chars_original": original_len,
|
||||
"doc_chars_sent": sent_len,
|
||||
"truncated": truncated,
|
||||
"skip_small_routed": False,
|
||||
"condition_a": {
|
||||
"input_tokens": a.get("input_tokens"),
|
||||
"output_tokens": a.get("output_tokens"),
|
||||
"latency_s": a.get("latency_s"),
|
||||
"entity_count": a_ents,
|
||||
"edge_count": a_edges,
|
||||
"stop_reason": a.get("stop_reason"),
|
||||
"response_text": a.get("response_text", "")[:4000],
|
||||
"error": a.get("error"),
|
||||
},
|
||||
"condition_b": {
|
||||
"skipped": "local_model_failed",
|
||||
"local_error": local_result["error"],
|
||||
"local_latency_s": local_result.get("latency_s"),
|
||||
},
|
||||
})
|
||||
continue
|
||||
|
||||
local_raw = local_result["response"]
|
||||
cands = parse_candidates(local_raw)
|
||||
local_candidates = cands or []
|
||||
print(f" B local: t={local_result['latency_s']}s candidates={len(local_candidates)}",
|
||||
flush=True)
|
||||
|
||||
if not local_candidates:
|
||||
print(f" B local: empty draft — skipping API call", flush=True)
|
||||
results.append({
|
||||
"source": source,
|
||||
"size_bucket": size_bucket,
|
||||
"doc_chars_original": original_len,
|
||||
"doc_chars_sent": sent_len,
|
||||
"truncated": truncated,
|
||||
"skip_small_routed": False,
|
||||
"condition_a": {
|
||||
"input_tokens": a.get("input_tokens"),
|
||||
"output_tokens": a.get("output_tokens"),
|
||||
"latency_s": a.get("latency_s"),
|
||||
"entity_count": a_ents,
|
||||
"edge_count": a_edges,
|
||||
"stop_reason": a.get("stop_reason"),
|
||||
"response_text": a.get("response_text", "")[:4000],
|
||||
"error": a.get("error"),
|
||||
},
|
||||
"condition_b": {
|
||||
"skipped": "local_draft_empty",
|
||||
"local_latency_s": local_result.get("latency_s"),
|
||||
"local_raw": local_raw[:1000],
|
||||
},
|
||||
})
|
||||
continue
|
||||
|
||||
# Compressed draft format — bare JSON array
|
||||
local_draft_json = json.dumps(local_candidates, ensure_ascii=False)
|
||||
b_prompt = CONDITION_B_API_PROMPT_COMPRESSED.replace("{local_draft_json}", local_draft_json) + doc_text
|
||||
|
||||
try:
|
||||
b = call_haiku(client, b_prompt)
|
||||
b_ents, b_edges = parse_graph(b["response_text"])
|
||||
print(f" B api: in={b['input_tokens']} out={b['output_tokens']} "
|
||||
f"ents={b_ents} edges={b_edges} stop={b['stop_reason']} t={b['latency_s']}s",
|
||||
flush=True)
|
||||
except Exception as e:
|
||||
print(f" B api FAILED: {e}", flush=True)
|
||||
b = {"error": str(e)}
|
||||
b_ents = b_edges = None
|
||||
|
||||
if "input_tokens" in a and "input_tokens" in b:
|
||||
in_pct = (b["input_tokens"] - a["input_tokens"]) / a["input_tokens"] * 100 if a["input_tokens"] else 0.0
|
||||
out_pct = (b["output_tokens"] - a["output_tokens"]) / a["output_tokens"] * 100 if a["output_tokens"] else 0.0
|
||||
edge_pct_str = "n/a"
|
||||
if a_edges and b_edges is not None and a_edges > 0:
|
||||
edge_pct_str = f"{(b_edges - a_edges) / a_edges * 100:+.1f}%"
|
||||
print(f" Δ input={in_pct:+.1f}% output={out_pct:+.1f}% edges={edge_pct_str}", flush=True)
|
||||
|
||||
results.append({
|
||||
"source": source,
|
||||
"size_bucket": size_bucket,
|
||||
"doc_chars_original": original_len,
|
||||
"doc_chars_sent": sent_len,
|
||||
"truncated": truncated,
|
||||
"skip_small_routed": skip_small_routed,
|
||||
"condition_a": {
|
||||
"input_tokens": a.get("input_tokens"),
|
||||
"output_tokens": a.get("output_tokens"),
|
||||
"latency_s": a.get("latency_s"),
|
||||
"entity_count": a_ents,
|
||||
"edge_count": a_edges,
|
||||
"stop_reason": a.get("stop_reason"),
|
||||
"response_text": a.get("response_text", "")[:4000],
|
||||
"error": a.get("error"),
|
||||
},
|
||||
"condition_b": {
|
||||
"skip_small_routed": skip_small_routed,
|
||||
"local_latency_s": local_result.get("latency_s"),
|
||||
"local_candidates": local_candidates,
|
||||
"local_raw": local_raw[:1000],
|
||||
"api_input_tokens": b.get("input_tokens"),
|
||||
"api_output_tokens": b.get("output_tokens"),
|
||||
"api_latency_s": b.get("latency_s"),
|
||||
"entity_count": b_ents,
|
||||
"edge_count": b_edges,
|
||||
"stop_reason": b.get("stop_reason"),
|
||||
"response_text": b.get("response_text", "")[:4000],
|
||||
"error": b.get("error"),
|
||||
},
|
||||
})
|
||||
|
||||
pg_conn.close()
|
||||
total_elapsed = round(time.time() - t_total, 1)
|
||||
|
||||
valid = [r for r in results
|
||||
if r.get("condition_a", {}).get("input_tokens") is not None
|
||||
and r.get("condition_b", {}).get("api_input_tokens") is not None]
|
||||
|
||||
a_in = sum(r["condition_a"]["input_tokens"] for r in valid)
|
||||
a_out = sum(r["condition_a"]["output_tokens"] for r in valid)
|
||||
b_in = sum(r["condition_b"]["api_input_tokens"] for r in valid)
|
||||
b_out = sum(r["condition_b"]["api_output_tokens"] for r in valid)
|
||||
a_cost = (a_in * HAIKU_IN_PER_M + a_out * HAIKU_OUT_PER_M) / 1_000_000
|
||||
b_cost = (b_in * HAIKU_IN_PER_M + b_out * HAIKU_OUT_PER_M) / 1_000_000
|
||||
|
||||
by_bucket = {}
|
||||
for bucket in ("small", "medium", "large"):
|
||||
rows = [r for r in valid if r["size_bucket"] == bucket]
|
||||
if not rows:
|
||||
by_bucket[bucket] = None
|
||||
continue
|
||||
ai = sum(r["condition_a"]["input_tokens"] for r in rows)
|
||||
ao = sum(r["condition_a"]["output_tokens"] for r in rows)
|
||||
bi = sum(r["condition_b"]["api_input_tokens"] for r in rows)
|
||||
bo = sum(r["condition_b"]["api_output_tokens"] for r in rows)
|
||||
ae = [r["condition_a"]["edge_count"] for r in rows if r["condition_a"]["edge_count"] is not None]
|
||||
be = [r["condition_b"]["edge_count"] for r in rows if r["condition_b"]["edge_count"] is not None]
|
||||
skip_count = sum(1 for r in rows if r.get("skip_small_routed"))
|
||||
by_bucket[bucket] = {
|
||||
"n": len(rows),
|
||||
"n_skip_small_routed": skip_count,
|
||||
"n_cascade": len(rows) - skip_count,
|
||||
"a_input_tokens": ai,
|
||||
"a_output_tokens": ao,
|
||||
"b_input_tokens": bi,
|
||||
"b_output_tokens": bo,
|
||||
"input_delta_pct": round((bi - ai) / ai * 100, 2) if ai else None,
|
||||
"output_delta_pct": round((bo - ao) / ao * 100, 2) if ao else None,
|
||||
"a_avg_edges": round(statistics.mean(ae), 1) if ae else None,
|
||||
"b_avg_edges": round(statistics.mean(be), 1) if be else None,
|
||||
}
|
||||
|
||||
summary = {
|
||||
"experiment": "cascade_optimization_test",
|
||||
"title": "Cascade Optimization — skip-small + compressed-draft",
|
||||
"started_at": started_at,
|
||||
"completed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"haiku_model": HAIKU_MODEL,
|
||||
"haiku_temperature": HAIKU_TEMPERATURE,
|
||||
"haiku_max_tokens": HAIKU_MAX_TOKENS,
|
||||
"local_model": LOCAL_MODEL,
|
||||
"max_doc_chars": MAX_DOC_CHARS,
|
||||
"skip_small_threshold": SKIP_SMALL_THRESHOLD,
|
||||
"n_documents": len(sample),
|
||||
"n_valid_pairs": len(valid),
|
||||
"n_skipped": len(sample) - len(valid),
|
||||
"total_elapsed_s": total_elapsed,
|
||||
"totals": {
|
||||
"a_input_tokens": a_in,
|
||||
"a_output_tokens": a_out,
|
||||
"b_input_tokens": b_in,
|
||||
"b_output_tokens": b_out,
|
||||
"a_cost_usd": round(a_cost, 4),
|
||||
"b_cost_usd": round(b_cost, 4),
|
||||
"cost_delta_usd": round(b_cost - a_cost, 4),
|
||||
"cost_delta_pct": round((b_cost - a_cost) / a_cost * 100, 2) if a_cost else None,
|
||||
"prior_unoptimized_cascade_pct": -12.66,
|
||||
"note": "API cost only — local Mistral runtime on VPS not monetized",
|
||||
},
|
||||
"by_size_bucket": by_bucket,
|
||||
"results": results,
|
||||
}
|
||||
|
||||
OUTPUT_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(OUTPUT_FILE, "w") as f:
|
||||
json.dump(summary, f, indent=2)
|
||||
|
||||
print()
|
||||
print("=" * 60)
|
||||
print(f"DONE — {len(valid)}/{len(sample)} valid pairs in {total_elapsed}s")
|
||||
print(f"A total cost: ${a_cost:.4f} (in={a_in} out={a_out})")
|
||||
print(f"B total cost: ${b_cost:.4f} (in={b_in} out={b_out})")
|
||||
delta_pct = summary['totals']['cost_delta_pct']
|
||||
if delta_pct is not None:
|
||||
verdict = "B cheaper" if delta_pct < 0 else "B more expensive"
|
||||
print(f"Cost delta: {delta_pct:+.2f}% ({verdict})")
|
||||
opt_delta = delta_pct - (-12.66)
|
||||
print(f"Optimization delta vs prior cascade: {opt_delta:+.2f} points "
|
||||
f"(prior was -12.66%)")
|
||||
print()
|
||||
print("By size bucket:")
|
||||
for bucket, stats in by_bucket.items():
|
||||
if stats:
|
||||
print(f" {bucket:6s} (n={stats['n']}, skip={stats['n_skip_small_routed']}): "
|
||||
f"in {stats['input_delta_pct']:+.1f}% "
|
||||
f"out {stats['output_delta_pct']:+.1f}% "
|
||||
f"edges A={stats['a_avg_edges']} B={stats['b_avg_edges']}")
|
||||
print()
|
||||
print("Results: " + str(OUTPUT_FILE))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,485 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Cascade Test — Nodes-vs-Edges Experiment
|
||||
|
||||
Tests whether splitting graph extraction into "local drafts entity candidates,
|
||||
API verifies + draws edges" reduces total API cost vs single-shot full
|
||||
extraction, while producing a comparable graph.
|
||||
|
||||
Two conditions per document:
|
||||
A — Baseline: single Claude Haiku call, full extraction
|
||||
B — Cascade: Mistral lists entity candidates, then Haiku does verify+edges
|
||||
|
||||
Both conditions:
|
||||
- See the full document (parity-respecting)
|
||||
- Use open entity type vocabulary (no fixed schema)
|
||||
- Use natural-language predicates (no constrained relations)
|
||||
- Same target output schema, same temperature
|
||||
|
||||
Sample: 20 docs from briefing_test_v2_results.json, stratified by char length.
|
||||
Reports API cost only. Local Mistral time is recorded but not monetized
|
||||
(ran on the VPS, no per-token API charge).
|
||||
|
||||
Outputs: ~/aaronai/experiments/cascade_test_results.json
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import statistics
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import anthropic
|
||||
import psycopg2
|
||||
import requests
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
V2_FILE = Path.home() / "aaronai" / "briefing_test_v2_results.json"
|
||||
OUTPUT_FILE = Path.home() / "aaronai" / "experiments" / "cascade_test_results.json"
|
||||
HAIKU_MODEL = "claude-haiku-4-5-20251001"
|
||||
HAIKU_MAX_TOKENS = 4096
|
||||
HAIKU_TEMPERATURE = 0.0
|
||||
OLLAMA_URL = "http://localhost:11434/api/generate"
|
||||
LOCAL_MODEL = "mistral"
|
||||
LOCAL_TIMEOUT = 120
|
||||
MAX_DOC_CHARS = 8000
|
||||
|
||||
# Verified pricing 2026-04-28 against Anthropic docs
|
||||
HAIKU_IN_PER_M = 1.0
|
||||
HAIKU_OUT_PER_M = 5.0
|
||||
|
||||
|
||||
CONDITION_A_PROMPT = """Extract a knowledge graph from the document below.
|
||||
|
||||
Return ONLY valid JSON with this exact schema:
|
||||
{
|
||||
"entities": [
|
||||
{"name": string, "type": string}
|
||||
],
|
||||
"edges": [
|
||||
{"subject": string, "predicate": string, "object": string}
|
||||
]
|
||||
}
|
||||
|
||||
Entity types: use whatever fits the entity. Do not constrain yourself to a fixed list.
|
||||
|
||||
Edge predicates: natural language phrases that capture the actual relationship the document states or implies.
|
||||
|
||||
Extract every entity and every relationship the document states or strongly implies. Both subject and object in every edge must appear in entities. JSON only, no commentary, no markdown fences.
|
||||
|
||||
DOCUMENT:
|
||||
"""
|
||||
|
||||
LOCAL_PROMPT = """List every named entity that appears in the document below — every person, organization, place, project, document, material, technique, date, event, or other named thing.
|
||||
|
||||
Return ONLY valid JSON:
|
||||
{
|
||||
"candidates": [string]
|
||||
}
|
||||
|
||||
Just names. No types, no relationships. JSON only.
|
||||
|
||||
DOCUMENT:
|
||||
"""
|
||||
|
||||
CONDITION_B_API_PROMPT_WITH_DRAFT = """Extract a knowledge graph from the document below.
|
||||
|
||||
A local model has identified entity candidates that may help orient your reading. Treat the candidates as a hint, not as truth — verify each candidate appears in the document, ignore any that do not, and add any entities the candidates missed.
|
||||
|
||||
Return ONLY valid JSON with this exact schema:
|
||||
{
|
||||
"entities": [
|
||||
{"name": string, "type": string}
|
||||
],
|
||||
"edges": [
|
||||
{"subject": string, "predicate": string, "object": string}
|
||||
]
|
||||
}
|
||||
|
||||
Entity types: use whatever fits. Edge predicates: natural language phrases capturing the actual relationship. Both subject and object in every edge must appear in entities. Extract every entity and every relationship the document states or strongly implies. JSON only, no commentary, no markdown fences.
|
||||
|
||||
ENTITY CANDIDATES FROM LOCAL MODEL:
|
||||
{local_draft}
|
||||
|
||||
DOCUMENT:
|
||||
"""
|
||||
|
||||
|
||||
def strip_json_fences(text):
|
||||
if not text:
|
||||
return ""
|
||||
t = text.strip()
|
||||
t = re.sub(r"^```(?:json)?\s*", "", t)
|
||||
t = re.sub(r"\s*```$", "", t)
|
||||
return t.strip()
|
||||
|
||||
|
||||
def fetch_document_text(pg_conn, source):
|
||||
cur = pg_conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT document FROM embeddings WHERE source = %s ORDER BY id",
|
||||
(source,),
|
||||
)
|
||||
rows = cur.fetchall()
|
||||
cur.close()
|
||||
if not rows:
|
||||
return None, 0
|
||||
full = "\n\n".join(r[0] for r in rows)
|
||||
return full[:MAX_DOC_CHARS], len(full)
|
||||
|
||||
|
||||
def call_haiku(client, prompt_text):
|
||||
t0 = time.time()
|
||||
resp = client.messages.create(
|
||||
model=HAIKU_MODEL,
|
||||
max_tokens=HAIKU_MAX_TOKENS,
|
||||
temperature=HAIKU_TEMPERATURE,
|
||||
messages=[{"role": "user", "content": prompt_text}],
|
||||
)
|
||||
return {
|
||||
"input_tokens": resp.usage.input_tokens,
|
||||
"output_tokens": resp.usage.output_tokens,
|
||||
"latency_s": round(time.time() - t0, 2),
|
||||
"response_text": resp.content[0].text if resp.content else "",
|
||||
"stop_reason": resp.stop_reason,
|
||||
}
|
||||
|
||||
|
||||
def call_local(document_text):
|
||||
t0 = time.time()
|
||||
try:
|
||||
resp = requests.post(
|
||||
OLLAMA_URL,
|
||||
json={
|
||||
"model": LOCAL_MODEL,
|
||||
"prompt": LOCAL_PROMPT + document_text,
|
||||
"stream": False,
|
||||
"format": "json",
|
||||
"options": {"num_predict": 1024, "temperature": 0, "num_ctx": 8192},
|
||||
},
|
||||
timeout=LOCAL_TIMEOUT,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return {
|
||||
"response": resp.json().get("response", ""),
|
||||
"latency_s": round(time.time() - t0, 2),
|
||||
}
|
||||
except Exception as e:
|
||||
return {"error": str(e), "latency_s": round(time.time() - t0, 2)}
|
||||
|
||||
|
||||
def parse_graph(raw):
|
||||
cleaned = strip_json_fences(raw)
|
||||
if not cleaned:
|
||||
return None, None
|
||||
try:
|
||||
data = json.loads(cleaned)
|
||||
except json.JSONDecodeError:
|
||||
return None, None
|
||||
if not isinstance(data, dict):
|
||||
return None, None
|
||||
ents = data.get("entities")
|
||||
edges = data.get("edges")
|
||||
if isinstance(ents, list) and isinstance(edges, list):
|
||||
return len(ents), len(edges)
|
||||
return None, None
|
||||
|
||||
|
||||
def parse_candidates(raw):
|
||||
cleaned = strip_json_fences(raw)
|
||||
if not cleaned:
|
||||
return None
|
||||
try:
|
||||
data = json.loads(cleaned)
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
if not isinstance(data, dict):
|
||||
return None
|
||||
cands = data.get("candidates")
|
||||
if isinstance(cands, list):
|
||||
return [str(c).strip() for c in cands if c]
|
||||
return None
|
||||
|
||||
|
||||
def stratify(docs):
|
||||
"""Pick 5 small / 10 medium / 5 large by character length, in file order."""
|
||||
sized = [(d, d["content_length"]) for d in docs]
|
||||
small = [d for d, n in sized if n < 1000]
|
||||
medium = [d for d, n in sized if 1000 <= n < 5000]
|
||||
large = [d for d, n in sized if n >= 5000]
|
||||
return small[:5] + medium[:10] + large[:5]
|
||||
|
||||
|
||||
def main():
|
||||
api_key = os.environ.get("ANTHROPIC_API_KEY")
|
||||
pg_dsn = os.environ.get("PG_DSN")
|
||||
if not api_key or not pg_dsn:
|
||||
print("ERROR: ANTHROPIC_API_KEY or PG_DSN not set", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
if not V2_FILE.exists():
|
||||
print(f"ERROR: {V2_FILE} not found", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
with open(V2_FILE) as f:
|
||||
v2 = json.load(f)
|
||||
|
||||
docs_meta = [d for d in v2["documents"] if d.get("status") == "SUCCESS"]
|
||||
sample = stratify(docs_meta)
|
||||
print(f"Sample: {len(sample)} docs (stratified by char length, file order)")
|
||||
for d in sample:
|
||||
print(f" [{d['content_length']:>6}c] {d['source'][:60]}")
|
||||
print(f"Haiku model: {HAIKU_MODEL} temp={HAIKU_TEMPERATURE} max_tokens={HAIKU_MAX_TOKENS}")
|
||||
print(f"Local model: {LOCAL_MODEL}")
|
||||
print()
|
||||
|
||||
client = anthropic.Anthropic(api_key=api_key)
|
||||
pg_conn = psycopg2.connect(pg_dsn)
|
||||
|
||||
results = []
|
||||
started_at = datetime.now(timezone.utc).isoformat()
|
||||
t_total = time.time()
|
||||
|
||||
for i, doc_meta in enumerate(sample, 1):
|
||||
source = doc_meta["source"]
|
||||
doc_text, original_len = fetch_document_text(pg_conn, source)
|
||||
if not doc_text:
|
||||
print(f"[{i:02d}/{len(sample)}] {source[:60]} — SKIP (not in pgvector)")
|
||||
results.append({"source": source, "skipped": "not_in_pgvector"})
|
||||
continue
|
||||
|
||||
sent_len = len(doc_text)
|
||||
truncated = original_len > sent_len
|
||||
size_bucket = (
|
||||
"small" if sent_len < 1000
|
||||
else "medium" if sent_len < 5000
|
||||
else "large"
|
||||
)
|
||||
trunc_marker = "*" if truncated else " "
|
||||
print(f"[{i:02d}/{len(sample)}] [{size_bucket:6s}] [{sent_len:>5}c{trunc_marker}] {source[:55]}", flush=True)
|
||||
|
||||
# Condition A
|
||||
try:
|
||||
a = call_haiku(client, CONDITION_A_PROMPT + doc_text)
|
||||
a_ents, a_edges = parse_graph(a["response_text"])
|
||||
print(f" A: in={a['input_tokens']} out={a['output_tokens']} "
|
||||
f"ents={a_ents} edges={a_edges} stop={a['stop_reason']} t={a['latency_s']}s",
|
||||
flush=True)
|
||||
except Exception as e:
|
||||
print(f" A FAILED: {e}", flush=True)
|
||||
a = {"error": str(e)}
|
||||
a_ents = a_edges = None
|
||||
|
||||
# Condition B local pass
|
||||
local_result = call_local(doc_text)
|
||||
if "error" in local_result:
|
||||
print(f" B local FAILED: {local_result['error']} — skipping doc", flush=True)
|
||||
results.append({
|
||||
"source": source,
|
||||
"size_bucket": size_bucket,
|
||||
"doc_chars_original": original_len,
|
||||
"doc_chars_sent": sent_len,
|
||||
"truncated": truncated,
|
||||
"condition_a": {
|
||||
"input_tokens": a.get("input_tokens"),
|
||||
"output_tokens": a.get("output_tokens"),
|
||||
"latency_s": a.get("latency_s"),
|
||||
"entity_count": a_ents,
|
||||
"edge_count": a_edges,
|
||||
"stop_reason": a.get("stop_reason"),
|
||||
"response_text": a.get("response_text", "")[:4000],
|
||||
"error": a.get("error"),
|
||||
},
|
||||
"condition_b": {
|
||||
"skipped": "local_model_failed",
|
||||
"local_error": local_result["error"],
|
||||
"local_latency_s": local_result.get("latency_s"),
|
||||
},
|
||||
})
|
||||
continue
|
||||
|
||||
local_raw = local_result["response"]
|
||||
cands = parse_candidates(local_raw)
|
||||
local_candidates = cands or []
|
||||
print(f" B local: t={local_result['latency_s']}s candidates={len(local_candidates)}",
|
||||
flush=True)
|
||||
|
||||
if not local_candidates:
|
||||
print(f" B local: empty draft — skipping API call to avoid asymmetric test", flush=True)
|
||||
results.append({
|
||||
"source": source,
|
||||
"size_bucket": size_bucket,
|
||||
"doc_chars_original": original_len,
|
||||
"doc_chars_sent": sent_len,
|
||||
"truncated": truncated,
|
||||
"condition_a": {
|
||||
"input_tokens": a.get("input_tokens"),
|
||||
"output_tokens": a.get("output_tokens"),
|
||||
"latency_s": a.get("latency_s"),
|
||||
"entity_count": a_ents,
|
||||
"edge_count": a_edges,
|
||||
"stop_reason": a.get("stop_reason"),
|
||||
"response_text": a.get("response_text", "")[:4000],
|
||||
"error": a.get("error"),
|
||||
},
|
||||
"condition_b": {
|
||||
"skipped": "local_draft_empty",
|
||||
"local_latency_s": local_result.get("latency_s"),
|
||||
"local_raw": local_raw[:1000],
|
||||
},
|
||||
})
|
||||
continue
|
||||
|
||||
local_draft_str = "\n".join(f"- {c}" for c in local_candidates)
|
||||
b_prompt = CONDITION_B_API_PROMPT_WITH_DRAFT.replace("{local_draft}", local_draft_str) + doc_text
|
||||
|
||||
try:
|
||||
b = call_haiku(client, b_prompt)
|
||||
b_ents, b_edges = parse_graph(b["response_text"])
|
||||
print(f" B api: in={b['input_tokens']} out={b['output_tokens']} "
|
||||
f"ents={b_ents} edges={b_edges} stop={b['stop_reason']} t={b['latency_s']}s",
|
||||
flush=True)
|
||||
except Exception as e:
|
||||
print(f" B api FAILED: {e}", flush=True)
|
||||
b = {"error": str(e)}
|
||||
b_ents = b_edges = None
|
||||
|
||||
if "input_tokens" in a and "input_tokens" in b:
|
||||
in_pct = (b["input_tokens"] - a["input_tokens"]) / a["input_tokens"] * 100 if a["input_tokens"] else 0.0
|
||||
out_pct = (b["output_tokens"] - a["output_tokens"]) / a["output_tokens"] * 100 if a["output_tokens"] else 0.0
|
||||
edge_pct_str = "n/a"
|
||||
if a_edges and b_edges is not None and a_edges > 0:
|
||||
edge_pct_str = f"{(b_edges - a_edges) / a_edges * 100:+.1f}%"
|
||||
print(f" Δ input={in_pct:+.1f}% output={out_pct:+.1f}% edges={edge_pct_str}", flush=True)
|
||||
|
||||
results.append({
|
||||
"source": source,
|
||||
"size_bucket": size_bucket,
|
||||
"doc_chars_original": original_len,
|
||||
"doc_chars_sent": sent_len,
|
||||
"truncated": truncated,
|
||||
"condition_a": {
|
||||
"input_tokens": a.get("input_tokens"),
|
||||
"output_tokens": a.get("output_tokens"),
|
||||
"latency_s": a.get("latency_s"),
|
||||
"entity_count": a_ents,
|
||||
"edge_count": a_edges,
|
||||
"stop_reason": a.get("stop_reason"),
|
||||
"response_text": a.get("response_text", "")[:4000],
|
||||
"error": a.get("error"),
|
||||
},
|
||||
"condition_b": {
|
||||
"local_latency_s": local_result.get("latency_s"),
|
||||
"local_candidates": local_candidates,
|
||||
"local_raw": local_raw[:1000],
|
||||
"api_input_tokens": b.get("input_tokens"),
|
||||
"api_output_tokens": b.get("output_tokens"),
|
||||
"api_latency_s": b.get("latency_s"),
|
||||
"entity_count": b_ents,
|
||||
"edge_count": b_edges,
|
||||
"stop_reason": b.get("stop_reason"),
|
||||
"response_text": b.get("response_text", "")[:4000],
|
||||
"error": b.get("error"),
|
||||
},
|
||||
})
|
||||
|
||||
pg_conn.close()
|
||||
total_elapsed = round(time.time() - t_total, 1)
|
||||
|
||||
valid = [r for r in results
|
||||
if r.get("condition_a", {}).get("input_tokens") is not None
|
||||
and r.get("condition_b", {}).get("api_input_tokens") is not None]
|
||||
|
||||
a_in = sum(r["condition_a"]["input_tokens"] for r in valid)
|
||||
a_out = sum(r["condition_a"]["output_tokens"] for r in valid)
|
||||
b_in = sum(r["condition_b"]["api_input_tokens"] for r in valid)
|
||||
b_out = sum(r["condition_b"]["api_output_tokens"] for r in valid)
|
||||
a_cost = (a_in * HAIKU_IN_PER_M + a_out * HAIKU_OUT_PER_M) / 1_000_000
|
||||
b_cost = (b_in * HAIKU_IN_PER_M + b_out * HAIKU_OUT_PER_M) / 1_000_000
|
||||
|
||||
by_bucket = {}
|
||||
for bucket in ("small", "medium", "large"):
|
||||
rows = [r for r in valid if r["size_bucket"] == bucket]
|
||||
if not rows:
|
||||
by_bucket[bucket] = None
|
||||
continue
|
||||
ai = sum(r["condition_a"]["input_tokens"] for r in rows)
|
||||
ao = sum(r["condition_a"]["output_tokens"] for r in rows)
|
||||
bi = sum(r["condition_b"]["api_input_tokens"] for r in rows)
|
||||
bo = sum(r["condition_b"]["api_output_tokens"] for r in rows)
|
||||
ae = [r["condition_a"]["edge_count"] for r in rows if r["condition_a"]["edge_count"] is not None]
|
||||
be = [r["condition_b"]["edge_count"] for r in rows if r["condition_b"]["edge_count"] is not None]
|
||||
by_bucket[bucket] = {
|
||||
"n": len(rows),
|
||||
"a_input_tokens": ai,
|
||||
"a_output_tokens": ao,
|
||||
"b_input_tokens": bi,
|
||||
"b_output_tokens": bo,
|
||||
"input_delta_pct": round((bi - ai) / ai * 100, 2) if ai else None,
|
||||
"output_delta_pct": round((bo - ao) / ao * 100, 2) if ao else None,
|
||||
"a_avg_edges": round(statistics.mean(ae), 1) if ae else None,
|
||||
"b_avg_edges": round(statistics.mean(be), 1) if be else None,
|
||||
}
|
||||
|
||||
summary = {
|
||||
"experiment": "cascade_test",
|
||||
"title": "Nodes-vs-Edges Cascade Experiment",
|
||||
"started_at": started_at,
|
||||
"completed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"haiku_model": HAIKU_MODEL,
|
||||
"haiku_temperature": HAIKU_TEMPERATURE,
|
||||
"haiku_max_tokens": HAIKU_MAX_TOKENS,
|
||||
"local_model": LOCAL_MODEL,
|
||||
"max_doc_chars": MAX_DOC_CHARS,
|
||||
"n_documents": len(sample),
|
||||
"n_valid_pairs": len(valid),
|
||||
"n_skipped": len(sample) - len(valid),
|
||||
"total_elapsed_s": total_elapsed,
|
||||
"totals": {
|
||||
"a_input_tokens": a_in,
|
||||
"a_output_tokens": a_out,
|
||||
"b_input_tokens": b_in,
|
||||
"b_output_tokens": b_out,
|
||||
"a_cost_usd": round(a_cost, 4),
|
||||
"b_cost_usd": round(b_cost, 4),
|
||||
"cost_delta_usd": round(b_cost - a_cost, 4),
|
||||
"cost_delta_pct": round((b_cost - a_cost) / a_cost * 100, 2) if a_cost else None,
|
||||
"note": "API cost only — local Mistral runtime on VPS not monetized",
|
||||
},
|
||||
"by_size_bucket": by_bucket,
|
||||
"results": results,
|
||||
}
|
||||
|
||||
OUTPUT_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(OUTPUT_FILE, "w") as f:
|
||||
json.dump(summary, f, indent=2)
|
||||
|
||||
print()
|
||||
print("=" * 60)
|
||||
print(f"DONE — {len(valid)}/{len(sample)} valid pairs in {total_elapsed}s")
|
||||
print(f"A total cost: ${a_cost:.4f} (in={a_in} out={a_out})")
|
||||
print(f"B total cost: ${b_cost:.4f} (in={b_in} out={b_out})")
|
||||
delta_pct = summary['totals']['cost_delta_pct']
|
||||
if delta_pct is not None:
|
||||
verdict = "B cheaper" if delta_pct < 0 else "B more expensive"
|
||||
print(f"Cost delta: {delta_pct:+.2f}% ({verdict})")
|
||||
print()
|
||||
print("By size bucket:")
|
||||
for bucket, stats in by_bucket.items():
|
||||
if stats:
|
||||
print(f" {bucket:6s} (n={stats['n']}): "
|
||||
f"in {stats['input_delta_pct']:+.1f}% "
|
||||
f"out {stats['output_delta_pct']:+.1f}% "
|
||||
f"edges A={stats['a_avg_edges']} B={stats['b_avg_edges']}")
|
||||
print()
|
||||
print(f"NOTE: API cost only. Local Mistral runtime is not monetized.")
|
||||
print(f"Results: {OUTPUT_FILE}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,230 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Experiment 003 — Entity-Only Consistency Test
|
||||
|
||||
Three Mistral passes per document, measure consistency on entity fields only
|
||||
(people, organizations, locations, dates). Excludes document_type label.
|
||||
DISTINCT ON (source) sampling — fixes Exp 001 chunk-replacement flaw.
|
||||
|
||||
Outputs: ~/aaronai/experiments/consistency_test_v2_results.json
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import psycopg2
|
||||
import requests
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
OUTPUT_FILE = Path.home() / "aaronai" / "experiments" / "consistency_test_v2_results.json"
|
||||
OLLAMA_URL = "http://localhost:11434/api/generate"
|
||||
MODEL = "mistral"
|
||||
N_PASSES = 3
|
||||
N_DOCS = 50
|
||||
PER_CALL_TIMEOUT = 60 # seconds — fail fast, don't wedge
|
||||
MAX_DOC_CHARS = 8000 # cap document length sent to Mistral
|
||||
|
||||
EXTRACTION_PROMPT = """Extract entities from the document below. Return ONLY valid JSON with this exact schema:
|
||||
{
|
||||
"people": [string],
|
||||
"organizations": [string],
|
||||
"locations": [string],
|
||||
"dates": [string]
|
||||
}
|
||||
Rules:
|
||||
- Only include entities you are CERTAIN about. If uncertain, omit.
|
||||
- No prose, no markdown fences, no commentary. JSON only.
|
||||
- Empty arrays are valid.
|
||||
|
||||
DOCUMENT:
|
||||
"""
|
||||
|
||||
|
||||
def call_mistral(document_text):
|
||||
truncated = document_text[:MAX_DOC_CHARS]
|
||||
t0 = time.time()
|
||||
try:
|
||||
resp = requests.post(
|
||||
OLLAMA_URL,
|
||||
json={
|
||||
"model": MODEL,
|
||||
"prompt": EXTRACTION_PROMPT + truncated,
|
||||
"stream": False,
|
||||
"format": "json",
|
||||
"options": {"num_predict": 512},
|
||||
},
|
||||
timeout=PER_CALL_TIMEOUT,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return {
|
||||
"response": resp.json().get("response", ""),
|
||||
"latency_s": round(time.time() - t0, 2),
|
||||
"truncated": len(document_text) > MAX_DOC_CHARS,
|
||||
}
|
||||
except requests.exceptions.Timeout:
|
||||
return {"error": f"timeout after {PER_CALL_TIMEOUT}s", "latency_s": PER_CALL_TIMEOUT}
|
||||
except Exception as e:
|
||||
return {"error": str(e), "latency_s": round(time.time() - t0, 2)}
|
||||
|
||||
|
||||
def parse_entities(raw_response):
|
||||
text = (raw_response or "").strip()
|
||||
text = re.sub(r"^```(?:json)?\s*", "", text)
|
||||
text = re.sub(r"\s*```$", "", text)
|
||||
try:
|
||||
data = json.loads(text)
|
||||
except json.JSONDecodeError:
|
||||
return None
|
||||
out = {}
|
||||
for key in ("people", "organizations", "locations", "dates"):
|
||||
vals = data.get(key, [])
|
||||
if not isinstance(vals, list):
|
||||
return None
|
||||
out[key] = sorted(set(str(v).strip().lower() for v in vals if v))
|
||||
return out
|
||||
|
||||
|
||||
def entities_match(a, b):
|
||||
if a is None or b is None:
|
||||
return False
|
||||
return all(a[k] == b[k] for k in ("people", "organizations", "locations", "dates"))
|
||||
|
||||
|
||||
def fetch_distinct_sources(pg_conn, n):
|
||||
cur = pg_conn.cursor()
|
||||
cur.execute("""
|
||||
SELECT source, string_agg(document, E'\n\n' ORDER BY id) AS doc
|
||||
FROM embeddings
|
||||
WHERE source IS NOT NULL
|
||||
GROUP BY source
|
||||
ORDER BY MIN(id)
|
||||
LIMIT %s
|
||||
""", (n,))
|
||||
rows = cur.fetchall()
|
||||
cur.close()
|
||||
return [(s, d) for s, d in rows if d and len(d.strip()) > 50]
|
||||
|
||||
|
||||
def main():
|
||||
pg_dsn = os.environ.get("PG_DSN")
|
||||
if not pg_dsn:
|
||||
print("ERROR: PG_DSN not set", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
pg_conn = psycopg2.connect(pg_dsn)
|
||||
docs = fetch_distinct_sources(pg_conn, N_DOCS)
|
||||
pg_conn.close()
|
||||
|
||||
print(f"Loaded {len(docs)} distinct sources from pgvector")
|
||||
print(f"Model: {MODEL} | Passes per doc: {N_PASSES}")
|
||||
print(f"Per-call timeout: {PER_CALL_TIMEOUT}s | Max doc chars: {MAX_DOC_CHARS}")
|
||||
print(f"Calls planned: {len(docs) * N_PASSES}\n")
|
||||
|
||||
results = []
|
||||
started_at = datetime.now(timezone.utc).isoformat()
|
||||
t_total = time.time()
|
||||
|
||||
for i, (source, doc_text) in enumerate(docs, 1):
|
||||
size_marker = f"[{len(doc_text):>5}c]"
|
||||
print(f"[{i:02d}/{len(docs)}] {size_marker} {source[:55]}", flush=True)
|
||||
passes = []
|
||||
for p in range(N_PASSES):
|
||||
r = call_mistral(doc_text)
|
||||
if "error" in r:
|
||||
print(f" pass {p+1}: {r['error']}", flush=True)
|
||||
passes.append({"error": r["error"], "parsed_ok": False, "latency_s": r["latency_s"]})
|
||||
else:
|
||||
entities = parse_entities(r["response"])
|
||||
passes.append({
|
||||
"raw": r["response"][:500],
|
||||
"entities": entities,
|
||||
"latency_s": r["latency_s"],
|
||||
"parsed_ok": entities is not None,
|
||||
"truncated_input": r.get("truncated", False),
|
||||
})
|
||||
|
||||
all_parsed = all(p.get("parsed_ok") for p in passes)
|
||||
if all_parsed:
|
||||
e1, e2, e3 = passes[0]["entities"], passes[1]["entities"], passes[2]["entities"]
|
||||
consistent = entities_match(e1, e2) and entities_match(e2, e3)
|
||||
per_field = {
|
||||
k: (e1[k] == e2[k] == e3[k])
|
||||
for k in ("people", "organizations", "locations", "dates")
|
||||
}
|
||||
else:
|
||||
consistent = False
|
||||
per_field = None
|
||||
|
||||
latencies = [p.get("latency_s", 0) for p in passes]
|
||||
print(f" parsed={all_parsed} consistent={consistent} latencies={latencies}", flush=True)
|
||||
|
||||
results.append({
|
||||
"source": source,
|
||||
"doc_chars": len(doc_text),
|
||||
"passes": passes,
|
||||
"all_parsed": all_parsed,
|
||||
"consistent": consistent,
|
||||
"per_field_consistency": per_field,
|
||||
})
|
||||
|
||||
total_elapsed = round(time.time() - t_total, 1)
|
||||
|
||||
parsed = [r for r in results if r["all_parsed"]]
|
||||
consistent = [r for r in parsed if r["consistent"]]
|
||||
|
||||
field_rates = {k: 0 for k in ("people", "organizations", "locations", "dates")}
|
||||
for r in parsed:
|
||||
for k, v in (r["per_field_consistency"] or {}).items():
|
||||
if v:
|
||||
field_rates[k] += 1
|
||||
field_rates_pct = {
|
||||
k: round(100 * v / len(parsed), 1) if parsed else 0.0
|
||||
for k, v in field_rates.items()
|
||||
}
|
||||
|
||||
summary = {
|
||||
"experiment": "003",
|
||||
"title": "Entity-Only Consistency Test",
|
||||
"started_at": started_at,
|
||||
"completed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"model": MODEL,
|
||||
"n_passes": N_PASSES,
|
||||
"per_call_timeout_s": PER_CALL_TIMEOUT,
|
||||
"max_doc_chars": MAX_DOC_CHARS,
|
||||
"n_documents": len(docs),
|
||||
"n_all_parsed": len(parsed),
|
||||
"n_fully_consistent": len(consistent),
|
||||
"consistency_rate_pct": round(100 * len(consistent) / len(docs), 2) if docs else 0.0,
|
||||
"consistency_rate_among_parsed_pct": (
|
||||
round(100 * len(consistent) / len(parsed), 2) if parsed else 0.0
|
||||
),
|
||||
"per_field_consistency_pct": field_rates_pct,
|
||||
"total_elapsed_s": total_elapsed,
|
||||
"exp_001_baseline_pct": 18.0,
|
||||
"results": results,
|
||||
}
|
||||
|
||||
OUTPUT_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(OUTPUT_FILE, "w") as f:
|
||||
json.dump(summary, f, indent=2)
|
||||
|
||||
print()
|
||||
print("=" * 60)
|
||||
print(f"DONE — {len(docs)} docs in {total_elapsed}s")
|
||||
print(f"All 3 passes parsed cleanly: {len(parsed)}/{len(docs)}")
|
||||
print(f"Fully consistent (all 4 fields match): {len(consistent)}/{len(docs)} ({summary['consistency_rate_pct']}%)")
|
||||
print(f"Among parsed only: {summary['consistency_rate_among_parsed_pct']}%")
|
||||
print(f"Per-field consistency: {field_rates_pct}")
|
||||
print(f"Exp 001 baseline: 18% | delta: {summary['consistency_rate_pct'] - 18.0:+.2f} pts")
|
||||
print(f"Results: {OUTPUT_FILE}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,179 @@
|
||||
"""
|
||||
Measure actual Graphiti BULK episode cost on a stratified sample.
|
||||
Uses /episodes/bulk endpoint. Submits in small batches to avoid rate limits.
|
||||
"""
|
||||
import json, os, random, time
|
||||
from pathlib import Path
|
||||
import psycopg2, requests
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
GRAPHITI_URL = "http://localhost:8001"
|
||||
PG_DSN = os.environ["PG_DSN"]
|
||||
SAMPLE_SIZE = 50
|
||||
BATCH_SIZE = 5
|
||||
RANDOM_SEED = 42
|
||||
|
||||
OUT = Path.home() / "aaronai" / "experiments" / "graphiti_bulk_cost_test.json"
|
||||
OUT.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
|
||||
def fetch_stratified_sample():
|
||||
conn = psycopg2.connect(PG_DSN)
|
||||
cur = conn.cursor()
|
||||
cur.execute("""
|
||||
SELECT source, STRING_AGG(document, E'\\n\\n' ORDER BY id) AS full_doc
|
||||
FROM embeddings
|
||||
GROUP BY source
|
||||
""")
|
||||
sources = [(s, doc) for s, doc in cur.fetchall() if doc]
|
||||
cur.close(); conn.close()
|
||||
|
||||
random.seed(RANDOM_SEED)
|
||||
short = [(s, d) for s, d in sources if len(d) < 1000]
|
||||
medium = [(s, d) for s, d in sources if 1000 <= len(d) < 5000]
|
||||
long_ = [(s, d) for s, d in sources if len(d) >= 5000]
|
||||
|
||||
print(f"Pool: short={len(short)} medium={len(medium)} long={len(long_)}")
|
||||
sample = (
|
||||
random.sample(short, min(15, len(short))) +
|
||||
random.sample(medium, min(25, len(medium))) +
|
||||
random.sample(long_, min(10, len(long_)))
|
||||
)
|
||||
print(f"Sample: {len(sample)} sources, batch_size={BATCH_SIZE}")
|
||||
return sample
|
||||
|
||||
|
||||
def submit_bulk_batch(batch):
|
||||
payload = {
|
||||
"episodes": [
|
||||
{
|
||||
"name": source,
|
||||
"content": doc[:12000],
|
||||
"source_description": "pgvector_migration_bulk_test",
|
||||
"timestamp": "2026-04-28T00:00:00",
|
||||
}
|
||||
for source, doc in batch
|
||||
]
|
||||
}
|
||||
t0 = time.time()
|
||||
try:
|
||||
r = requests.post(f"{GRAPHITI_URL}/episodes/bulk", json=payload, timeout=900)
|
||||
elapsed = time.time() - t0
|
||||
return {
|
||||
"batch_size": len(batch),
|
||||
"status_code": r.status_code,
|
||||
"elapsed_s": round(elapsed, 2),
|
||||
"elapsed_per_episode_s": round(elapsed / len(batch), 2),
|
||||
"response": r.json() if r.ok else None,
|
||||
"error": None if r.ok else r.text[:500],
|
||||
"sources": [s for s, _ in batch],
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
"batch_size": len(batch),
|
||||
"status_code": None,
|
||||
"elapsed_s": round(time.time() - t0, 2),
|
||||
"elapsed_per_episode_s": None,
|
||||
"response": None,
|
||||
"error": str(e)[:500],
|
||||
"sources": [s for s, _ in batch],
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
print("=" * 60)
|
||||
print("Graphiti BULK Migration Cost Test (Haiku 4.5)")
|
||||
print("=" * 60)
|
||||
print()
|
||||
print("BEFORE running:")
|
||||
print(" 1. Open https://console.anthropic.com/settings/usage")
|
||||
print(" 2. Note current spend.")
|
||||
print()
|
||||
input("Press Enter when noted... ")
|
||||
print()
|
||||
|
||||
sample = fetch_stratified_sample()
|
||||
if not sample:
|
||||
print("ERROR: empty sample"); return
|
||||
|
||||
batches = [sample[i:i+BATCH_SIZE] for i in range(0, len(sample), BATCH_SIZE)]
|
||||
print(f"Submitting {len(batches)} batches of up to {BATCH_SIZE} episodes")
|
||||
print()
|
||||
|
||||
results = []
|
||||
total_start = time.time()
|
||||
for i, batch in enumerate(batches, start=1):
|
||||
avg_chars = int(sum(len(d) for _, d in batch) / len(batch))
|
||||
print(f"[batch {i:2d}/{len(batches)}] n={len(batch)} avg_chars={avg_chars:6d}",
|
||||
end=" ", flush=True)
|
||||
result = submit_bulk_batch(batch)
|
||||
results.append(result)
|
||||
if result["error"]:
|
||||
print(f" ERROR: {result['error'][:80]}")
|
||||
if "429" in (result["error"] or "") or "rate" in (result["error"] or "").lower():
|
||||
print(" Rate limited - pausing 30s before next batch")
|
||||
time.sleep(30)
|
||||
else:
|
||||
print(f" {result['status_code']} {result['elapsed_s']}s "
|
||||
f"({result['elapsed_per_episode_s']}s/episode)")
|
||||
total_elapsed = time.time() - total_start
|
||||
|
||||
successful_batches = [r for r in results if r["error"] is None]
|
||||
failed_batches = [r for r in results if r["error"] is not None]
|
||||
successful_episodes = sum(r["batch_size"] for r in successful_batches)
|
||||
failed_episodes = sum(r["batch_size"] for r in failed_batches)
|
||||
|
||||
summary = {
|
||||
"sample_size": len(sample),
|
||||
"batch_size": BATCH_SIZE,
|
||||
"n_batches": len(batches),
|
||||
"successful_batches": len(successful_batches),
|
||||
"failed_batches": len(failed_batches),
|
||||
"successful_episodes": successful_episodes,
|
||||
"failed_episodes": failed_episodes,
|
||||
"total_elapsed_s": round(total_elapsed, 1),
|
||||
"mean_elapsed_per_episode_s": round(
|
||||
sum(r["elapsed_s"] for r in successful_batches) /
|
||||
max(successful_episodes, 1), 2
|
||||
) if successful_episodes else None,
|
||||
"results": results,
|
||||
}
|
||||
|
||||
conn = psycopg2.connect(PG_DSN)
|
||||
cur = conn.cursor()
|
||||
cur.execute("SELECT COUNT(DISTINCT source) FROM embeddings")
|
||||
total_sources = cur.fetchone()[0]
|
||||
cur.close(); conn.close()
|
||||
|
||||
summary["total_corpus_sources"] = total_sources
|
||||
if summary["mean_elapsed_per_episode_s"]:
|
||||
summary["estimated_migration_hours"] = round(
|
||||
total_sources * summary["mean_elapsed_per_episode_s"] / 3600, 1
|
||||
)
|
||||
|
||||
OUT.write_text(json.dumps(summary, indent=2))
|
||||
|
||||
print()
|
||||
print("=" * 60)
|
||||
print("RESULTS")
|
||||
print("=" * 60)
|
||||
print(f"Episodes: {summary['successful_episodes']}/{summary['sample_size']} succeeded")
|
||||
print(f"Batches: {summary['successful_batches']}/{summary['n_batches']} succeeded")
|
||||
print(f"Total elapsed: {summary['total_elapsed_s']}s")
|
||||
if summary["mean_elapsed_per_episode_s"]:
|
||||
print(f"Mean per episode: {summary['mean_elapsed_per_episode_s']}s")
|
||||
print(f"Total corpus sources: {summary['total_corpus_sources']}")
|
||||
print(f"Estimated migration runtime: {summary['estimated_migration_hours']} hours")
|
||||
print()
|
||||
print(f"AFTER:")
|
||||
print(f" Wait 5 min; note new Anthropic spend; subtract from $28.61 baseline.")
|
||||
print(f" delta / {summary['successful_episodes']} = per-episode cost")
|
||||
print(f" per-episode * {summary['total_corpus_sources']} = full migration estimate")
|
||||
print()
|
||||
print(f"Full results: {OUT}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,122 @@
|
||||
"""
|
||||
Retest just the previously-failed batches after raising MAX_QUEUED_QUERIES.
|
||||
Reads failed sources from graphiti_bulk_cost_test.json and resubmits.
|
||||
"""
|
||||
import json, os, time
|
||||
from pathlib import Path
|
||||
import psycopg2, requests
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
GRAPHITI_URL = "http://localhost:8001"
|
||||
PG_DSN = os.environ["PG_DSN"]
|
||||
BATCH_SIZE = 5
|
||||
|
||||
PRIOR_RESULTS = Path.home() / "aaronai" / "experiments" / "graphiti_bulk_cost_test.json"
|
||||
OUT = Path.home() / "aaronai" / "experiments" / "graphiti_bulk_retry.json"
|
||||
|
||||
|
||||
def fetch_doc_for_source(cur, source):
|
||||
cur.execute("""
|
||||
SELECT STRING_AGG(document, E'\\n\\n' ORDER BY id)
|
||||
FROM embeddings WHERE source = %s
|
||||
""", (source,))
|
||||
row = cur.fetchone()
|
||||
return row[0] if row else None
|
||||
|
||||
|
||||
def submit_bulk_batch(batch):
|
||||
payload = {"episodes": [
|
||||
{"name": s, "content": d[:12000],
|
||||
"source_description": "pgvector_migration_bulk_retry",
|
||||
"timestamp": "2026-04-28T00:00:00"}
|
||||
for s, d in batch
|
||||
]}
|
||||
t0 = time.time()
|
||||
try:
|
||||
r = requests.post(f"{GRAPHITI_URL}/episodes/bulk", json=payload, timeout=900)
|
||||
return {
|
||||
"batch_size": len(batch),
|
||||
"status_code": r.status_code,
|
||||
"elapsed_s": round(time.time() - t0, 2),
|
||||
"elapsed_per_episode_s": round((time.time() - t0) / len(batch), 2),
|
||||
"error": None if r.ok else r.text[:500],
|
||||
"sources": [s for s, _ in batch],
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
"batch_size": len(batch),
|
||||
"status_code": None,
|
||||
"elapsed_s": round(time.time() - t0, 2),
|
||||
"elapsed_per_episode_s": None,
|
||||
"error": str(e)[:500],
|
||||
"sources": [s for s, _ in batch],
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
prior = json.loads(PRIOR_RESULTS.read_text())
|
||||
failed_sources = []
|
||||
for batch_result in prior["results"]:
|
||||
if batch_result["error"] is not None:
|
||||
failed_sources.extend(batch_result["sources"])
|
||||
print(f"Retrying {len(failed_sources)} previously-failed sources")
|
||||
|
||||
conn = psycopg2.connect(PG_DSN)
|
||||
cur = conn.cursor()
|
||||
sources_with_docs = []
|
||||
for s in failed_sources:
|
||||
doc = fetch_doc_for_source(cur, s)
|
||||
if doc:
|
||||
sources_with_docs.append((s, doc))
|
||||
else:
|
||||
print(f" WARN: could not find doc for source {s}")
|
||||
cur.close(); conn.close()
|
||||
print(f"Loaded {len(sources_with_docs)} source docs")
|
||||
print()
|
||||
|
||||
batches = [sources_with_docs[i:i+BATCH_SIZE]
|
||||
for i in range(0, len(sources_with_docs), BATCH_SIZE)]
|
||||
|
||||
results = []
|
||||
total_start = time.time()
|
||||
for i, batch in enumerate(batches, start=1):
|
||||
avg = int(sum(len(d) for _, d in batch) / len(batch))
|
||||
print(f"[batch {i:2d}/{len(batches)}] n={len(batch)} avg_chars={avg:6d}",
|
||||
end=" ", flush=True)
|
||||
result = submit_bulk_batch(batch)
|
||||
results.append(result)
|
||||
if result["error"]:
|
||||
print(f" ERROR: {result['error'][:80]}")
|
||||
else:
|
||||
print(f" {result['status_code']} {result['elapsed_s']}s")
|
||||
total_elapsed = time.time() - total_start
|
||||
|
||||
successful = [r for r in results if r["error"] is None]
|
||||
failed = [r for r in results if r["error"] is not None]
|
||||
summary = {
|
||||
"n_retry_sources": len(sources_with_docs),
|
||||
"n_batches": len(batches),
|
||||
"successful_batches": len(successful),
|
||||
"failed_batches": len(failed),
|
||||
"successful_episodes": sum(r["batch_size"] for r in successful),
|
||||
"failed_episodes": sum(r["batch_size"] for r in failed),
|
||||
"total_elapsed_s": round(total_elapsed, 1),
|
||||
"results": results,
|
||||
}
|
||||
OUT.write_text(json.dumps(summary, indent=2))
|
||||
|
||||
print()
|
||||
print("=" * 60)
|
||||
print("RETRY RESULTS")
|
||||
print("=" * 60)
|
||||
print(f"Episodes: {summary['successful_episodes']}/{len(sources_with_docs)} succeeded")
|
||||
print(f"Batches: {summary['successful_batches']}/{summary['n_batches']} succeeded")
|
||||
print(f"Total elapsed: {summary['total_elapsed_s']}s")
|
||||
print()
|
||||
print(f"Full results: {OUT}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,93 @@
|
||||
"""Retry attempt #2 — for sources that timed out after MAX_QUEUED_QUERIES bump."""
|
||||
import json, os, time
|
||||
from pathlib import Path
|
||||
import psycopg2, requests
|
||||
from dotenv import load_dotenv
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
GRAPHITI_URL = "http://localhost:8001"
|
||||
PG_DSN = os.environ["PG_DSN"]
|
||||
BATCH_SIZE = 3 # smaller batches given timeouts
|
||||
|
||||
PRIOR = Path.home() / "aaronai" / "experiments" / "graphiti_bulk_retry.json"
|
||||
OUT = Path.home() / "aaronai" / "experiments" / "graphiti_bulk_retry2.json"
|
||||
|
||||
|
||||
def fetch_doc(cur, source):
|
||||
cur.execute("SELECT STRING_AGG(document, E'\\n\\n' ORDER BY id) FROM embeddings WHERE source = %s", (source,))
|
||||
row = cur.fetchone()
|
||||
return row[0] if row else None
|
||||
|
||||
|
||||
def submit_batch(batch):
|
||||
payload = {"episodes": [
|
||||
{"name": s, "content": d[:12000],
|
||||
"source_description": "pgvector_migration_bulk_retry2",
|
||||
"timestamp": "2026-04-28T00:00:00"}
|
||||
for s, d in batch
|
||||
]}
|
||||
t0 = time.time()
|
||||
try:
|
||||
r = requests.post(f"{GRAPHITI_URL}/episodes/bulk", json=payload, timeout=900)
|
||||
return {
|
||||
"batch_size": len(batch),
|
||||
"status_code": r.status_code,
|
||||
"elapsed_s": round(time.time() - t0, 2),
|
||||
"error": None if r.ok else r.text[:500],
|
||||
"sources": [s for s, _ in batch],
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
"batch_size": len(batch),
|
||||
"status_code": None,
|
||||
"elapsed_s": round(time.time() - t0, 2),
|
||||
"error": str(e)[:500],
|
||||
"sources": [s for s, _ in batch],
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
prior = json.loads(PRIOR.read_text())
|
||||
failed = []
|
||||
for r in prior["results"]:
|
||||
if r["error"] is not None:
|
||||
failed.extend(r["sources"])
|
||||
print(f"Retry #2: {len(failed)} sources still failing")
|
||||
|
||||
conn = psycopg2.connect(PG_DSN); cur = conn.cursor()
|
||||
sources = []
|
||||
for s in failed:
|
||||
d = fetch_doc(cur, s)
|
||||
if d: sources.append((s, d))
|
||||
cur.close(); conn.close()
|
||||
|
||||
batches = [sources[i:i+BATCH_SIZE] for i in range(0, len(sources), BATCH_SIZE)]
|
||||
print(f"Submitting {len(batches)} batches of up to {BATCH_SIZE}\n")
|
||||
|
||||
results = []
|
||||
for i, batch in enumerate(batches, 1):
|
||||
avg = int(sum(len(d) for _, d in batch) / len(batch))
|
||||
print(f"[batch {i}/{len(batches)}] n={len(batch)} avg_chars={avg:6d}", end=" ", flush=True)
|
||||
r = submit_batch(batch)
|
||||
results.append(r)
|
||||
if r["error"]: print(f" ERROR: {r['error'][:80]}")
|
||||
else: print(f" {r['status_code']} {r['elapsed_s']}s")
|
||||
|
||||
succ = [r for r in results if r["error"] is None]
|
||||
fail = [r for r in results if r["error"] is not None]
|
||||
summary = {
|
||||
"n_sources": len(sources),
|
||||
"successful_batches": len(succ),
|
||||
"failed_batches": len(fail),
|
||||
"successful_episodes": sum(r["batch_size"] for r in succ),
|
||||
"failed_episodes": sum(r["batch_size"] for r in fail),
|
||||
"results": results,
|
||||
}
|
||||
OUT.write_text(json.dumps(summary, indent=2))
|
||||
print()
|
||||
print(f"Episodes: {summary['successful_episodes']}/{len(sources)} succeeded")
|
||||
print(f"Full results: {OUT}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,175 @@
|
||||
"""
|
||||
Measure actual Graphiti episode-add cost on a stratified sample of pgvector sources.
|
||||
"""
|
||||
import json, os, random, time
|
||||
from pathlib import Path
|
||||
import psycopg2, requests
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
GRAPHITI_URL = "http://localhost:8001"
|
||||
PG_DSN = os.environ["PG_DSN"]
|
||||
SAMPLE_SIZE = 50
|
||||
RANDOM_SEED = 42
|
||||
|
||||
OUT = Path.home() / "aaronai" / "experiments" / "graphiti_cost_test.json"
|
||||
OUT.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
|
||||
def fetch_stratified_sample():
|
||||
conn = psycopg2.connect(PG_DSN)
|
||||
cur = conn.cursor()
|
||||
cur.execute("""
|
||||
SELECT source, STRING_AGG(document, E'\\n\\n' ORDER BY id) AS full_doc
|
||||
FROM embeddings
|
||||
GROUP BY source
|
||||
""")
|
||||
sources = [(s, doc) for s, doc in cur.fetchall() if doc]
|
||||
cur.close(); conn.close()
|
||||
|
||||
random.seed(RANDOM_SEED)
|
||||
short = [(s, d) for s, d in sources if len(d) < 1000]
|
||||
medium = [(s, d) for s, d in sources if 1000 <= len(d) < 5000]
|
||||
long_ = [(s, d) for s, d in sources if len(d) >= 5000]
|
||||
|
||||
print(f"Pool: short={len(short)} medium={len(medium)} long={len(long_)}")
|
||||
sample = (
|
||||
random.sample(short, min(15, len(short))) +
|
||||
random.sample(medium, min(25, len(medium))) +
|
||||
random.sample(long_, min(10, len(long_)))
|
||||
)
|
||||
print(f"Sample: {len(sample)} sources")
|
||||
return sample
|
||||
|
||||
|
||||
def submit_episode(source: str, document: str) -> dict:
|
||||
payload = {
|
||||
"name": source,
|
||||
"content": document[:12000],
|
||||
"source_description": "pgvector_migration_cost_test",
|
||||
"timestamp": "2026-04-28T00:00:00",
|
||||
}
|
||||
t0 = time.time()
|
||||
try:
|
||||
r = requests.post(f"{GRAPHITI_URL}/episodes", json=payload, timeout=600)
|
||||
return {
|
||||
"source": source,
|
||||
"doc_chars": len(document),
|
||||
"doc_chars_sent": min(len(document), 12000),
|
||||
"status_code": r.status_code,
|
||||
"elapsed_s": round(time.time() - t0, 2),
|
||||
"error": None if r.ok else r.text[:500],
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
"source": source,
|
||||
"doc_chars": len(document),
|
||||
"doc_chars_sent": min(len(document), 12000),
|
||||
"status_code": None,
|
||||
"elapsed_s": round(time.time() - t0, 2),
|
||||
"error": str(e)[:500],
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
print("=" * 60)
|
||||
print("Graphiti Migration Cost Test (Haiku 4.5)")
|
||||
print("=" * 60)
|
||||
print()
|
||||
print("BEFORE running:")
|
||||
print(" 1. Open https://console.anthropic.com/settings/usage")
|
||||
print(" 2. Note current spend.")
|
||||
print()
|
||||
input("Press Enter when noted... ")
|
||||
print()
|
||||
|
||||
sample = fetch_stratified_sample()
|
||||
if not sample:
|
||||
print("ERROR: empty sample"); return
|
||||
|
||||
# Smoke test
|
||||
print(f"Smoke test on first source ({sample[0][0][:50]}...):")
|
||||
smoke = submit_episode(*sample[0])
|
||||
print(f" status={smoke['status_code']} elapsed={smoke['elapsed_s']}s")
|
||||
if smoke["error"]:
|
||||
print(f" ERROR: {smoke['error']}")
|
||||
OUT.write_text(json.dumps({"smoke_test": smoke}, indent=2))
|
||||
print("Halted — fix smoke test before bulk run.")
|
||||
return
|
||||
print(f" OK. Proceeding with {len(sample)} sources.")
|
||||
print()
|
||||
|
||||
results = [smoke]
|
||||
total_start = time.time()
|
||||
for i, (source, doc) in enumerate(sample[1:], start=2):
|
||||
bucket = "short" if len(doc) < 1000 else "medium" if len(doc) < 5000 else "long"
|
||||
print(f"[{i:2d}/{len(sample)}] [{bucket:6s}] [{len(doc):6d}c] {source[:50]:50s}", end=" ", flush=True)
|
||||
result = submit_episode(source, doc)
|
||||
results.append(result)
|
||||
if result["error"]:
|
||||
print(f" ERROR: {result['error'][:80]}")
|
||||
else:
|
||||
print(f" {result['status_code']} {result['elapsed_s']}s")
|
||||
total_elapsed = time.time() - total_start
|
||||
|
||||
successful = [r for r in results if r["error"] is None]
|
||||
failed = [r for r in results if r["error"] is not None]
|
||||
|
||||
summary = {
|
||||
"sample_size": len(sample),
|
||||
"successful": len(successful),
|
||||
"failed": len(failed),
|
||||
"total_elapsed_s": round(total_elapsed, 1),
|
||||
"mean_elapsed_per_episode_s": round(
|
||||
sum(r["elapsed_s"] for r in successful) / max(len(successful), 1), 2
|
||||
),
|
||||
"by_bucket": {},
|
||||
"results": results,
|
||||
}
|
||||
|
||||
for bname, lo, hi in [("short", 0, 1000), ("medium", 1000, 5000), ("long", 5000, 10**9)]:
|
||||
b = [r for r in successful if lo <= r["doc_chars"] < hi]
|
||||
if b:
|
||||
summary["by_bucket"][bname] = {
|
||||
"n": len(b),
|
||||
"mean_elapsed_s": round(sum(r["elapsed_s"] for r in b) / len(b), 2),
|
||||
"mean_chars": int(sum(r["doc_chars"] for r in b) / len(b)),
|
||||
}
|
||||
|
||||
conn = psycopg2.connect(PG_DSN)
|
||||
cur = conn.cursor()
|
||||
cur.execute("SELECT COUNT(DISTINCT source) FROM embeddings")
|
||||
total_sources = cur.fetchone()[0]
|
||||
cur.close(); conn.close()
|
||||
|
||||
summary["total_corpus_sources"] = total_sources
|
||||
summary["estimated_migration_hours"] = round(
|
||||
total_sources * summary["mean_elapsed_per_episode_s"] / 3600, 1
|
||||
)
|
||||
|
||||
OUT.write_text(json.dumps(summary, indent=2))
|
||||
|
||||
print()
|
||||
print("=" * 60)
|
||||
print("RESULTS")
|
||||
print("=" * 60)
|
||||
print(f"Sample: {summary['successful']}/{summary['sample_size']} succeeded, {summary['failed']} failed")
|
||||
print(f"Total elapsed: {summary['total_elapsed_s']}s")
|
||||
print(f"Mean per episode: {summary['mean_elapsed_per_episode_s']}s")
|
||||
for bucket, stats in summary["by_bucket"].items():
|
||||
print(f" {bucket:6s} n={stats['n']:3d} chars~{stats['mean_chars']:6d} elapsed~{stats['mean_elapsed_s']}s")
|
||||
print()
|
||||
print(f"Total corpus sources: {summary['total_corpus_sources']}")
|
||||
print(f"Estimated migration runtime: {summary['estimated_migration_hours']} hours")
|
||||
print()
|
||||
print("AFTER:")
|
||||
print(" Wait 5 min; note new Anthropic spend; subtract.")
|
||||
print(f" test_cost / {summary['successful']} = per-episode cost")
|
||||
print(f" per-episode * {summary['total_corpus_sources']} = full migration estimate")
|
||||
print()
|
||||
print(f"Full results: {OUT}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,155 @@
|
||||
"""
|
||||
E1.4 per-source predicate diversity comparison — fixed version.
|
||||
Looks up episode uuids by name in both production and cascade graphs.
|
||||
"""
|
||||
import json
|
||||
from collections import defaultdict
|
||||
from falkordb import FalkorDB
|
||||
|
||||
E14_RESULTS = "/home/aaron/aaronai/experiments/e14_cascade_results.json"
|
||||
PRODUCTION_GROUP = "aaron"
|
||||
CASCADE_GROUP = "aaron_cascade_e14"
|
||||
|
||||
def get_predicates_for_episode(graph, episode_uuid):
|
||||
query = """
|
||||
MATCH ()-[r:RELATES_TO]->()
|
||||
WHERE $uuid IN r.episodes
|
||||
RETURN count(DISTINCT r.name) AS predicate_count
|
||||
"""
|
||||
result = graph.query(query, {"uuid": episode_uuid})
|
||||
rows = result.result_set
|
||||
return rows[0][0] if rows else 0
|
||||
|
||||
def get_edge_count_for_episode(graph, episode_uuid):
|
||||
query = """
|
||||
MATCH ()-[r:RELATES_TO]->()
|
||||
WHERE $uuid IN r.episodes
|
||||
RETURN count(r) AS edge_count
|
||||
"""
|
||||
result = graph.query(query, {"uuid": episode_uuid})
|
||||
rows = result.result_set
|
||||
return rows[0][0] if rows else 0
|
||||
|
||||
def find_episode_uuid(graph, source_name):
|
||||
query = """
|
||||
MATCH (e:Episodic {name: $name})
|
||||
RETURN e.uuid AS uuid
|
||||
LIMIT 1
|
||||
"""
|
||||
result = graph.query(query, {"name": source_name})
|
||||
rows = result.result_set
|
||||
return rows[0][0] if rows else None
|
||||
|
||||
def main():
|
||||
db = FalkorDB(host='localhost', port=6379)
|
||||
prod_graph = db.select_graph(PRODUCTION_GROUP)
|
||||
cascade_graph = db.select_graph(CASCADE_GROUP)
|
||||
|
||||
with open(E14_RESULTS) as f:
|
||||
e14 = json.load(f)
|
||||
|
||||
sources = [r for r in e14['results'] if 'submit_result' in r]
|
||||
print(f"Analyzing {len(sources)} sources...")
|
||||
print()
|
||||
|
||||
comparisons = []
|
||||
missing_prod = 0
|
||||
missing_cascade = 0
|
||||
for src in sources:
|
||||
name = src['name']
|
||||
bucket = src['bucket']
|
||||
|
||||
prod_uuid = find_episode_uuid(prod_graph, name)
|
||||
cascade_uuid = find_episode_uuid(cascade_graph, name)
|
||||
|
||||
if not prod_uuid:
|
||||
missing_prod += 1
|
||||
print(f" WARN: missing in production: {name}")
|
||||
continue
|
||||
if not cascade_uuid:
|
||||
missing_cascade += 1
|
||||
print(f" WARN: missing in cascade: {name}")
|
||||
continue
|
||||
|
||||
prod_preds = get_predicates_for_episode(prod_graph, prod_uuid)
|
||||
cascade_preds = get_predicates_for_episode(cascade_graph, cascade_uuid)
|
||||
prod_edges = get_edge_count_for_episode(prod_graph, prod_uuid)
|
||||
cascade_edges = get_edge_count_for_episode(cascade_graph, cascade_uuid)
|
||||
|
||||
comparisons.append({
|
||||
"name": name,
|
||||
"bucket": bucket,
|
||||
"prod_preds": prod_preds,
|
||||
"cascade_preds": cascade_preds,
|
||||
"delta_preds": cascade_preds - prod_preds,
|
||||
"prod_edges": prod_edges,
|
||||
"cascade_edges": cascade_edges,
|
||||
"delta_edges": cascade_edges - prod_edges,
|
||||
})
|
||||
|
||||
if missing_prod or missing_cascade:
|
||||
print()
|
||||
print(f"Missing: {missing_prod} prod, {missing_cascade} cascade")
|
||||
print()
|
||||
|
||||
if not comparisons:
|
||||
print("No comparable sources found. Aborting.")
|
||||
return
|
||||
|
||||
# Per-source detail
|
||||
print(f"{'Bucket':<10} {'Source':<58} {'Preds A→B':<14} {'Δ':<6} {'Edges A→B':<14} {'Δ'}")
|
||||
print("-" * 115)
|
||||
for c in sorted(comparisons, key=lambda x: (x['bucket'], x['name'])):
|
||||
name_short = (c['name'][:55] + '..') if len(c['name']) > 58 else c['name']
|
||||
preds_str = f"{c['prod_preds']}→{c['cascade_preds']}"
|
||||
edges_str = f"{c['prod_edges']}→{c['cascade_edges']}"
|
||||
print(f"{c['bucket']:<10} {name_short:<58} {preds_str:<14} {c['delta_preds']:+d} {edges_str:<14} {c['delta_edges']:+d}")
|
||||
|
||||
# Per-bucket aggregation
|
||||
print()
|
||||
print("=" * 115)
|
||||
print("PER-BUCKET AGGREGATION")
|
||||
print("=" * 115)
|
||||
by_bucket = defaultdict(list)
|
||||
for c in comparisons:
|
||||
by_bucket[c['bucket']].append(c)
|
||||
|
||||
for bucket in ['high', 'mid', 'low', 'document']:
|
||||
items = by_bucket.get(bucket, [])
|
||||
if not items:
|
||||
continue
|
||||
n = len(items)
|
||||
sum_pp = sum(c['prod_preds'] for c in items)
|
||||
sum_cp = sum(c['cascade_preds'] for c in items)
|
||||
sum_pe = sum(c['prod_edges'] for c in items)
|
||||
sum_ce = sum(c['cascade_edges'] for c in items)
|
||||
positive = sum(1 for c in items if c['delta_preds'] > 0)
|
||||
negative = sum(1 for c in items if c['delta_preds'] < 0)
|
||||
flat = sum(1 for c in items if c['delta_preds'] == 0)
|
||||
pct_pred = ((sum_cp - sum_pp) / sum_pp * 100) if sum_pp else 0
|
||||
pct_edge = ((sum_ce - sum_pe) / sum_pe * 100) if sum_pe else 0
|
||||
print(f"\n{bucket.upper()} (n={n}):")
|
||||
print(f" Predicates: {sum_pp} → {sum_cp} ({pct_pred:+.1f}%)")
|
||||
print(f" Edges: {sum_pe} → {sum_ce} ({pct_edge:+.1f}%)")
|
||||
print(f" Outcomes: {positive} positive, {flat} flat, {negative} negative")
|
||||
|
||||
# Aggregate
|
||||
print()
|
||||
print("=" * 115)
|
||||
print(f"AGGREGATE (n={len(comparisons)})")
|
||||
print("=" * 115)
|
||||
total_pp = sum(c['prod_preds'] for c in comparisons)
|
||||
total_cp = sum(c['cascade_preds'] for c in comparisons)
|
||||
total_pe = sum(c['prod_edges'] for c in comparisons)
|
||||
total_ce = sum(c['cascade_edges'] for c in comparisons)
|
||||
print(f" Predicates: {total_pp} → {total_cp} ({(total_cp-total_pp)/total_pp*100:+.1f}%)")
|
||||
print(f" Edges: {total_pe} → {total_ce} ({(total_ce-total_pe)/total_pe*100:+.1f}%)")
|
||||
|
||||
out_path = "/home/aaron/aaronai/experiments/e14_per_source_comparison.json"
|
||||
with open(out_path, "w") as f:
|
||||
json.dump(comparisons, f, indent=2)
|
||||
print()
|
||||
print(f"Saved to {out_path}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,208 @@
|
||||
#!/usr/bin/env python3
|
||||
"""E1.4 orchestration — cascade re-extraction at n=30, group_id=aaron_cascade_e14."""
|
||||
import json
|
||||
import os
|
||||
import requests
|
||||
import time
|
||||
from pathlib import Path
|
||||
import psycopg2
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
EXPERIMENTS = Path.home() / "aaronai" / "experiments"
|
||||
SAMPLE_FILE = EXPERIMENTS / "e14_sample.json"
|
||||
RESULTS_FILE = EXPERIMENTS / "e14_cascade_results.json"
|
||||
PG_DSN = os.environ["PG_DSN"]
|
||||
SIDECAR_URL = "http://localhost:8001"
|
||||
TEST_GROUP_ID = "aaron_cascade_e14"
|
||||
MAX_DOC_CHARS = 12000
|
||||
|
||||
METADATA_PROMPT = """You are a metadata extraction system. Given a document, produce structural and content metadata in strict JSON format.
|
||||
|
||||
Do not summarize the content beyond the one-sentence summary field. Do not extract entities or relationships. Do not interpret meaning. Produce only the metadata schema below.
|
||||
|
||||
Output JSON only. No prose, no explanation, no markdown code fences.
|
||||
|
||||
Schema:
|
||||
{
|
||||
"language": "<ISO 639-1 code>",
|
||||
"char_length": <integer>,
|
||||
"primary_format": "<prose|slides|code|structured|mixed>",
|
||||
"structural_signals": {
|
||||
"has_headings": <boolean>,
|
||||
"has_bullet_lists": <boolean>,
|
||||
"has_numbered_lists": <boolean>,
|
||||
"has_tables": <boolean>,
|
||||
"has_code_blocks": <boolean>,
|
||||
"has_dates": <boolean>
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": <boolean>,
|
||||
"has_institutional_language": <boolean>,
|
||||
"has_technical_terminology": <boolean>,
|
||||
"has_first_person": <boolean>,
|
||||
"has_quotations": <boolean>
|
||||
},
|
||||
"domain_class": "<technical|administrative|educational|personal|conversational>",
|
||||
"one_sentence_summary": "<one sentence describing what the document is about>"
|
||||
}
|
||||
|
||||
Document:
|
||||
"""
|
||||
|
||||
|
||||
def get_pg():
|
||||
return psycopg2.connect(PG_DSN)
|
||||
|
||||
|
||||
def fetch_source_text(source):
|
||||
conn = get_pg()
|
||||
cur = conn.cursor()
|
||||
cur.execute("""
|
||||
SELECT STRING_AGG(document, E'\n\n' ORDER BY id) AS full_doc
|
||||
FROM embeddings WHERE source = %s
|
||||
""", (source,))
|
||||
row = cur.fetchone()
|
||||
conn.close()
|
||||
if row is None or row[0] is None:
|
||||
return None
|
||||
return row[0]
|
||||
|
||||
|
||||
def run_mistral_metadata(text, max_retries=2):
|
||||
truncated = text[:MAX_DOC_CHARS]
|
||||
prompt = METADATA_PROMPT + truncated
|
||||
last_err = None
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
response = requests.post(
|
||||
"http://localhost:11434/api/generate",
|
||||
json={"model": "mistral:latest", "prompt": prompt, "stream": False, "format": "json"},
|
||||
timeout=300,
|
||||
)
|
||||
response.raise_for_status()
|
||||
raw = response.json()["response"]
|
||||
try:
|
||||
metadata = json.loads(raw)
|
||||
metadata["char_length"] = len(truncated)
|
||||
return metadata
|
||||
except json.JSONDecodeError:
|
||||
return {"error": "JSON parse failed", "raw": raw[:500]}
|
||||
except (requests.exceptions.ReadTimeout, requests.exceptions.ConnectionError) as e:
|
||||
last_err = e
|
||||
if attempt < max_retries - 1:
|
||||
print(f" (retry {attempt+1} after {type(e).__name__})", end=" ", flush=True)
|
||||
time.sleep(5)
|
||||
continue
|
||||
return {"error": f"After {max_retries} retries: {last_err}"}
|
||||
|
||||
|
||||
def format_metadata_as_orientation(metadata):
|
||||
if "error" in metadata:
|
||||
return None
|
||||
summary = metadata.get("one_sentence_summary", "")
|
||||
domain = metadata.get("domain_class", "unknown")
|
||||
fmt = metadata.get("primary_format", "unknown")
|
||||
return (
|
||||
f"This is a {domain} document in {fmt} format. "
|
||||
f"Summary: {summary} "
|
||||
f"This metadata is provided to orient your extraction, not to constrain it. "
|
||||
f"Extract entities and relationships freely from the document text itself; "
|
||||
f"the metadata is descriptive context, not a checklist."
|
||||
)
|
||||
|
||||
|
||||
def submit_episode_singular(name, content, custom_instructions):
|
||||
payload = {
|
||||
"name": name,
|
||||
"content": content[:MAX_DOC_CHARS],
|
||||
"source_description": "e14_replication_run",
|
||||
"timestamp": "2026-04-29T00:00:00",
|
||||
"group_id": TEST_GROUP_ID,
|
||||
"custom_extraction_instructions": custom_instructions,
|
||||
}
|
||||
response = requests.post(f"{SIDECAR_URL}/episodes", json=payload, timeout=300)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
|
||||
def load_state():
|
||||
if RESULTS_FILE.exists():
|
||||
with open(RESULTS_FILE) as f:
|
||||
data = json.load(f)
|
||||
return data.get("results", []), {r["name"] for r in data.get("results", []) if "submit_result" in r}
|
||||
return [], set()
|
||||
|
||||
|
||||
def main():
|
||||
with open(SAMPLE_FILE) as f:
|
||||
sample = json.load(f)
|
||||
selected = sample["selected"]
|
||||
|
||||
results, completed = load_state()
|
||||
if completed:
|
||||
print(f"Resuming — {len(completed)} sources already completed, {len(selected) - len(completed)} remaining\n")
|
||||
else:
|
||||
print(f"E1.4 cascade replication — {len(selected)} episodes to group_id={TEST_GROUP_ID}\n")
|
||||
|
||||
for i, ep in enumerate(selected, 1):
|
||||
name = ep["name"]
|
||||
bucket = ep["bucket"]
|
||||
if name in completed:
|
||||
print(f"[{i}/{len(selected)}] [{bucket}] {name} — SKIP (already completed)")
|
||||
continue
|
||||
|
||||
print(f"[{i}/{len(selected)}] [{bucket}] {name}")
|
||||
record = {"name": name, "bucket": bucket, "tier1_entities": ep["entities"]}
|
||||
if ep.get("subtype"):
|
||||
record["subtype"] = ep["subtype"]
|
||||
|
||||
print(f" Fetching source text...", end=" ", flush=True)
|
||||
text = fetch_source_text(name)
|
||||
if text is None:
|
||||
print("FAILED — no chunks in pgvector")
|
||||
record["error"] = "no source text"
|
||||
results.append(record)
|
||||
with open(RESULTS_FILE, "w") as f:
|
||||
json.dump({"results": results}, f, indent=2, default=str)
|
||||
continue
|
||||
record["doc_chars"] = len(text)
|
||||
print(f"{len(text)} chars")
|
||||
|
||||
print(f" Generating Mistral metadata...", end=" ", flush=True)
|
||||
t0 = time.time()
|
||||
metadata = run_mistral_metadata(text)
|
||||
elapsed = time.time() - t0
|
||||
record["metadata"] = metadata
|
||||
record["metadata_elapsed_s"] = round(elapsed, 1)
|
||||
if "error" in metadata:
|
||||
print(f"FAILED in {elapsed:.1f}s")
|
||||
else:
|
||||
print(f"{elapsed:.1f}s — domain={metadata.get('domain_class')}, format={metadata.get('primary_format')}")
|
||||
|
||||
custom_instructions = format_metadata_as_orientation(metadata)
|
||||
record["custom_extraction_instructions"] = custom_instructions
|
||||
print(f" Submitting via /episodes...", end=" ", flush=True)
|
||||
t0 = time.time()
|
||||
try:
|
||||
result = submit_episode_singular(name, text, custom_instructions)
|
||||
elapsed = time.time() - t0
|
||||
print(f"{elapsed:.1f}s — OK")
|
||||
record["submit_elapsed_s"] = round(elapsed, 1)
|
||||
record["submit_result"] = result
|
||||
except Exception as e:
|
||||
elapsed = time.time() - t0
|
||||
print(f"{elapsed:.1f}s — FAILED: {e}")
|
||||
record["submit_error"] = str(e)
|
||||
|
||||
results.append(record)
|
||||
with open(RESULTS_FILE, "w") as f:
|
||||
json.dump({"results": results}, f, indent=2, default=str)
|
||||
print()
|
||||
|
||||
print(f"\nDone. Results saved to {RESULTS_FILE}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,160 @@
|
||||
#!/usr/bin/env python3
|
||||
"""E1.4 sample selection — n=30 stratified, excluding E1's 10 sources."""
|
||||
import json
|
||||
import re
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
|
||||
EXPERIMENTS = Path.home() / "aaronai" / "experiments"
|
||||
E1_SAMPLE_FILE = EXPERIMENTS / "cascade_reextract_sample.json"
|
||||
OUTPUT = EXPERIMENTS / "e14_sample.json"
|
||||
|
||||
TARGETS = {"high": 8, "mid": 8, "low": 8, "document": 6}
|
||||
|
||||
def query_episode_counts():
|
||||
query = ("MATCH (e:Episodic) OPTIONAL MATCH (e)-[r]-(n:Entity) "
|
||||
"RETURN e.name AS name, count(distinct n) AS entities "
|
||||
"ORDER BY entities DESC")
|
||||
result = subprocess.run(
|
||||
["docker", "exec", "falkordb", "redis-cli", "GRAPH.QUERY", "aaron", query],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
lines = [l for l in result.stdout.split("\n") if l.strip()]
|
||||
episodes = []
|
||||
i = 0
|
||||
while i < len(lines):
|
||||
if lines[i] == "name":
|
||||
i += 2
|
||||
continue
|
||||
if lines[i].startswith("Cached") or lines[i].startswith("Query"):
|
||||
break
|
||||
if i + 1 < len(lines):
|
||||
try:
|
||||
count = int(lines[i + 1])
|
||||
episodes.append({"name": lines[i], "entities": count})
|
||||
i += 2
|
||||
except ValueError:
|
||||
i += 1
|
||||
else:
|
||||
i += 1
|
||||
return episodes
|
||||
|
||||
|
||||
def is_document(name):
|
||||
return any(name.lower().endswith(ext) for ext in (".pdf", ".docx", ".pptx", ".txt", ".md"))
|
||||
|
||||
|
||||
def doc_subtype(name):
|
||||
"""Categorize document by likely subtype."""
|
||||
s = name.lower()
|
||||
if "syllabus" in s or "ind study" in s or "_is" in s:
|
||||
return "academic"
|
||||
if "annual" in s or "report" in s or "_ar20" in s or "rtpcc" in s or "novo" in s:
|
||||
return "reference"
|
||||
if "cv" in s or "resume" in s or "application" in s or "cover letter" in s:
|
||||
return "reference"
|
||||
if "marquee" in s or "pptx" in s.lower() or "presentation" in s:
|
||||
return "creative"
|
||||
return "other"
|
||||
|
||||
|
||||
def main():
|
||||
print("Fetching episode entity counts from Tier 1 graph...")
|
||||
episodes = query_episode_counts()
|
||||
print(f"Got {len(episodes)} episodes")
|
||||
|
||||
# Load E1's sample to exclude
|
||||
with open(E1_SAMPLE_FILE) as f:
|
||||
e1_sample = json.load(f)
|
||||
e1_names = {ep["name"] for ep in e1_sample["selected"]}
|
||||
print(f"Excluding {len(e1_names)} sources from E1")
|
||||
|
||||
# Quartile boundaries
|
||||
counts = sorted([e["entities"] for e in episodes], reverse=True)
|
||||
n = len(counts)
|
||||
top_q = counts[n // 4]
|
||||
bottom_q = counts[3 * n // 4]
|
||||
print(f"Quartile boundaries: top≥{top_q}, mid={bottom_q+1}-{top_q-1}, low≤{bottom_q}")
|
||||
|
||||
# Filter out E1 and bucket
|
||||
available = [e for e in episodes if e["name"] not in e1_names]
|
||||
|
||||
high = [e for e in available if e["entities"] >= top_q and not is_document(e["name"])]
|
||||
mid = [e for e in available if bottom_q < e["entities"] < top_q and not is_document(e["name"])]
|
||||
low = [e for e in available if e["entities"] <= bottom_q and not is_document(e["name"])]
|
||||
docs = [e for e in available if is_document(e["name"]) and e["entities"] >= 5]
|
||||
|
||||
print(f"\nAvailable after E1 exclusion:")
|
||||
print(f" High-density: {len(high)}")
|
||||
print(f" Mid-density: {len(mid)}")
|
||||
print(f" Low-density: {len(low)}")
|
||||
print(f" Documents: {len(docs)}")
|
||||
|
||||
# For high/mid/low: take from middle of bucket (avoids edge cases)
|
||||
def pick(bucket, n):
|
||||
if len(bucket) < n:
|
||||
print(f" WARNING: only {len(bucket)} available, asked for {n}")
|
||||
return bucket
|
||||
mid_idx = len(bucket) // 2
|
||||
start = max(0, mid_idx - n // 2)
|
||||
return bucket[start:start + n]
|
||||
|
||||
selected = []
|
||||
for ep in pick(high, TARGETS["high"]):
|
||||
ep["bucket"] = "high"
|
||||
selected.append(ep)
|
||||
for ep in pick(mid, TARGETS["mid"]):
|
||||
ep["bucket"] = "mid"
|
||||
selected.append(ep)
|
||||
for ep in pick(low, TARGETS["low"]):
|
||||
ep["bucket"] = "low"
|
||||
selected.append(ep)
|
||||
|
||||
# For documents: stratify by subtype, target 2 academic, 2 creative, 2 reference
|
||||
doc_targets = {"academic": 2, "creative": 2, "reference": 2}
|
||||
docs_by_subtype = {}
|
||||
for ep in docs:
|
||||
st = doc_subtype(ep["name"])
|
||||
ep["subtype"] = st
|
||||
docs_by_subtype.setdefault(st, []).append(ep)
|
||||
print(f"\n Doc subtypes available: {[(k, len(v)) for k, v in docs_by_subtype.items()]}")
|
||||
|
||||
# Pick from middle of each subtype bucket
|
||||
for subtype, target in doc_targets.items():
|
||||
sub_docs = docs_by_subtype.get(subtype, [])
|
||||
picked = pick(sub_docs, target)
|
||||
for ep in picked:
|
||||
ep["bucket"] = "document"
|
||||
selected.append(ep)
|
||||
|
||||
# If we're short on documents (e.g., subtype underrepresented), fill from "other"
|
||||
doc_count = sum(1 for s in selected if s.get("bucket") == "document")
|
||||
if doc_count < TARGETS["document"]:
|
||||
shortage = TARGETS["document"] - doc_count
|
||||
leftover = [e for e in docs if e["name"] not in {s["name"] for s in selected}]
|
||||
for ep in leftover[:shortage]:
|
||||
ep["bucket"] = "document"
|
||||
ep["subtype"] = ep.get("subtype") or doc_subtype(ep["name"])
|
||||
selected.append(ep)
|
||||
|
||||
print(f"\nSelected {len(selected)} episodes for E1.4:")
|
||||
for ep in selected:
|
||||
sub = f"/{ep.get('subtype')}" if ep.get('bucket') == 'document' else ""
|
||||
print(f" [{ep['bucket']}{sub:>10}] {ep['entities']:>3}e {ep['name']}")
|
||||
|
||||
with open(OUTPUT, "w") as f:
|
||||
json.dump({
|
||||
"metadata": {
|
||||
"purpose": "E1.4 cascade re-extraction replication (n=30)",
|
||||
"exclusions": "E1's 10 sources",
|
||||
"stratification": {**TARGETS, "document_subtypes": doc_targets},
|
||||
"quartile_top": top_q,
|
||||
"quartile_bottom": bottom_q,
|
||||
},
|
||||
"selected": selected,
|
||||
}, f, indent=2)
|
||||
print(f"\nSaved to {OUTPUT}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,246 @@
|
||||
"""
|
||||
E1.6 analysis — correlate domain-purity ratings with cascade outcomes.
|
||||
Applies pre-registered decision rules from E1.6 protocol.
|
||||
"""
|
||||
import json
|
||||
from collections import defaultdict
|
||||
|
||||
RATINGS_PATH = "/home/aaron/aaronai/experiments/e16_purity_ratings.json"
|
||||
COMPARISON_PATH = "/home/aaron/aaronai/experiments/e14_per_source_comparison.json"
|
||||
|
||||
|
||||
def spearman(xs, ys):
|
||||
"""Compute Spearman rank correlation."""
|
||||
n = len(xs)
|
||||
if n < 2:
|
||||
return None
|
||||
# Rank the values
|
||||
def rank(values):
|
||||
sorted_idx = sorted(range(len(values)), key=lambda i: values[i])
|
||||
ranks = [0] * len(values)
|
||||
i = 0
|
||||
while i < len(values):
|
||||
j = i
|
||||
while j + 1 < len(values) and values[sorted_idx[j+1]] == values[sorted_idx[i]]:
|
||||
j += 1
|
||||
avg_rank = (i + j) / 2 + 1
|
||||
for k in range(i, j + 1):
|
||||
ranks[sorted_idx[k]] = avg_rank
|
||||
i = j + 1
|
||||
return ranks
|
||||
rx = rank(xs)
|
||||
ry = rank(ys)
|
||||
mean_rx = sum(rx) / n
|
||||
mean_ry = sum(ry) / n
|
||||
num = sum((rx[i] - mean_rx) * (ry[i] - mean_ry) for i in range(n))
|
||||
den_x = (sum((rx[i] - mean_rx) ** 2 for i in range(n))) ** 0.5
|
||||
den_y = (sum((ry[i] - mean_ry) ** 2 for i in range(n))) ** 0.5
|
||||
if den_x == 0 or den_y == 0:
|
||||
return None
|
||||
return num / (den_x * den_y)
|
||||
|
||||
|
||||
def main():
|
||||
with open(RATINGS_PATH) as f:
|
||||
ratings_data = json.load(f)
|
||||
with open(COMPARISON_PATH) as f:
|
||||
comparisons = json.load(f)
|
||||
|
||||
ratings_by_name = {r['name']: r for r in ratings_data['ratings']}
|
||||
comp_by_name = {c['name']: c for c in comparisons}
|
||||
|
||||
# Join ratings with cascade outcomes
|
||||
joined = []
|
||||
for name, rating in ratings_by_name.items():
|
||||
if name in comp_by_name:
|
||||
comp = comp_by_name[name]
|
||||
joined.append({
|
||||
'name': name,
|
||||
'binary': rating['binary'],
|
||||
'score': rating['score'],
|
||||
'note': rating.get('note'),
|
||||
'bucket': comp['bucket'],
|
||||
'delta_preds': comp['delta_preds'],
|
||||
'delta_edges': comp['delta_edges'],
|
||||
'prod_preds': comp['prod_preds'],
|
||||
'cascade_preds': comp['cascade_preds'],
|
||||
})
|
||||
|
||||
print("=" * 100)
|
||||
print(f"E1.6 ANALYSIS — Domain Purity vs Cascade Outcome (n={len(joined)})")
|
||||
print("=" * 100)
|
||||
|
||||
# Per-source detail with rating
|
||||
print()
|
||||
print(f"{'Bucket':<10} {'Source':<48} {'Domain':<8} {'Score':<6} {'Δpreds':<8} {'Δedges':<8}")
|
||||
print("-" * 100)
|
||||
for j in sorted(joined, key=lambda x: (x['binary'], -x['score'], x['bucket'], x['name'])):
|
||||
name_short = (j['name'][:45] + '..') if len(j['name']) > 48 else j['name']
|
||||
print(f"{j['bucket']:<10} {name_short:<48} {j['binary']:<8} {j['score']:<6} {j['delta_preds']:+d} {j['delta_edges']:+d}")
|
||||
|
||||
# PRIMARY TEST: binary purity vs cascade outcome distribution
|
||||
print()
|
||||
print("=" * 100)
|
||||
print("PRIMARY TEST: Binary purity vs cascade outcome distribution")
|
||||
print("=" * 100)
|
||||
|
||||
def categorize_outcome(delta):
|
||||
if delta > 0:
|
||||
return 'positive'
|
||||
elif delta < 0:
|
||||
return 'negative'
|
||||
else:
|
||||
return 'flat'
|
||||
|
||||
by_binary = defaultdict(lambda: {'positive': 0, 'flat': 0, 'negative': 0, 'total': 0})
|
||||
for j in joined:
|
||||
outcome = categorize_outcome(j['delta_preds'])
|
||||
by_binary[j['binary']][outcome] += 1
|
||||
by_binary[j['binary']]['total'] += 1
|
||||
|
||||
print()
|
||||
print(f"{'Group':<15} {'n':<5} {'Positive':<12} {'Flat':<10} {'Negative':<12}")
|
||||
print("-" * 60)
|
||||
for binary in ['single', 'multi']:
|
||||
d = by_binary[binary]
|
||||
n = d['total']
|
||||
if n == 0:
|
||||
continue
|
||||
pos_pct = d['positive'] / n * 100
|
||||
flat_pct = d['flat'] / n * 100
|
||||
neg_pct = d['negative'] / n * 100
|
||||
print(f"{binary+'-domain':<15} {n:<5} {d['positive']} ({pos_pct:.0f}%) {d['flat']} ({flat_pct:.0f}%) {d['negative']} ({neg_pct:.0f}%)")
|
||||
|
||||
# Compute the gap
|
||||
if by_binary['single']['total'] > 0 and by_binary['multi']['total'] > 0:
|
||||
single_pos_rate = by_binary['single']['positive'] / by_binary['single']['total'] * 100
|
||||
multi_pos_rate = by_binary['multi']['positive'] / by_binary['multi']['total'] * 100
|
||||
gap = single_pos_rate - multi_pos_rate
|
||||
print()
|
||||
print(f"Cascade-positive rate gap (single - multi): {gap:+.1f} percentage points")
|
||||
print()
|
||||
# Apply pre-registered decision rule
|
||||
if gap >= 20:
|
||||
verdict = "NARROWNESS HYPOTHESIS SUPPORTED"
|
||||
detail = f"Single-domain content is {gap:.0f}pp more likely to gain from cascade than multi-domain."
|
||||
elif gap <= -20:
|
||||
verdict = "REVERSE OF HYPOTHESIS"
|
||||
detail = f"Multi-domain content unexpectedly benefits more (counter to prediction)."
|
||||
elif abs(gap) < 10:
|
||||
verdict = "HYPOTHESIS NOT SUPPORTED"
|
||||
detail = "Domain purity does not appear to predict cascade outcome."
|
||||
else:
|
||||
verdict = "INCONCLUSIVE"
|
||||
detail = f"Gap of {gap:+.0f}pp is suggestive but below the pre-registered 20pp threshold."
|
||||
print(f" Pre-registered decision rule: {verdict}")
|
||||
print(f" {detail}")
|
||||
|
||||
# SECONDARY TEST: Spearman correlation between purity score and predicate delta
|
||||
print()
|
||||
print("=" * 100)
|
||||
print("SECONDARY TEST: Spearman rank correlation (purity score vs predicate delta)")
|
||||
print("=" * 100)
|
||||
|
||||
scores = [j['score'] for j in joined]
|
||||
deltas_pred = [j['delta_preds'] for j in joined]
|
||||
deltas_edge = [j['delta_edges'] for j in joined]
|
||||
|
||||
rho_pred = spearman(scores, deltas_pred)
|
||||
rho_edge = spearman(scores, deltas_edge)
|
||||
|
||||
print()
|
||||
print(f" Spearman ρ (purity score vs Δpredicates): {rho_pred:.3f}")
|
||||
print(f" Spearman ρ (purity score vs Δedges): {rho_edge:.3f}")
|
||||
print()
|
||||
|
||||
if rho_pred is not None:
|
||||
if rho_pred >= 0.4:
|
||||
v = "STRONG POSITIVE — narrowness hypothesis supported with monotonic relationship"
|
||||
elif rho_pred >= 0.2:
|
||||
v = "WEAK POSITIVE — consistent with hypothesis but not strong evidence"
|
||||
elif rho_pred <= -0.2:
|
||||
v = "NEGATIVE — refutes hypothesis"
|
||||
else:
|
||||
v = "NO CORRELATION — hypothesis not supported"
|
||||
print(f" Predicate delta verdict: {v}")
|
||||
print()
|
||||
|
||||
# TERTIARY TEST: within-bucket correlation
|
||||
print()
|
||||
print("=" * 100)
|
||||
print("TERTIARY TEST: Within-bucket correlation")
|
||||
print("=" * 100)
|
||||
|
||||
by_bucket = defaultdict(list)
|
||||
for j in joined:
|
||||
by_bucket[j['bucket']].append(j)
|
||||
|
||||
print()
|
||||
print(f"{'Bucket':<12} {'n':<5} {'Single':<10} {'Multi':<10} {'ρ (score vs Δpred)':<22}")
|
||||
print("-" * 75)
|
||||
for bucket in ['high', 'mid', 'low', 'document']:
|
||||
items = by_bucket.get(bucket, [])
|
||||
if not items:
|
||||
continue
|
||||
n = len(items)
|
||||
n_single = sum(1 for j in items if j['binary'] == 'single')
|
||||
n_multi = sum(1 for j in items if j['binary'] == 'multi')
|
||||
if n >= 3:
|
||||
scores_b = [j['score'] for j in items]
|
||||
deltas_b = [j['delta_preds'] for j in items]
|
||||
rho_b = spearman(scores_b, deltas_b)
|
||||
rho_str = f"{rho_b:+.3f}" if rho_b is not None else "n/a (no variance)"
|
||||
else:
|
||||
rho_str = "n/a (too few)"
|
||||
print(f"{bucket:<12} {n:<5} {n_single:<10} {n_multi:<10} {rho_str}")
|
||||
|
||||
# Interaction with bucket: do single/multi outcomes differ within bucket?
|
||||
print()
|
||||
print("Per-bucket cascade-positive rate by binary purity:")
|
||||
print()
|
||||
print(f"{'Bucket':<12} {'Single':<25} {'Multi':<25}")
|
||||
print("-" * 65)
|
||||
for bucket in ['high', 'mid', 'low', 'document']:
|
||||
items = by_bucket.get(bucket, [])
|
||||
if not items:
|
||||
continue
|
||||
single_items = [j for j in items if j['binary'] == 'single']
|
||||
multi_items = [j for j in items if j['binary'] == 'multi']
|
||||
def rate_str(group):
|
||||
if not group:
|
||||
return "—"
|
||||
pos = sum(1 for j in group if j['delta_preds'] > 0)
|
||||
return f"{pos}/{len(group)} positive ({pos/len(group)*100:.0f}%)"
|
||||
print(f"{bucket:<12} {rate_str(single_items):<25} {rate_str(multi_items):<25}")
|
||||
|
||||
# MEAN DELTA by binary group
|
||||
print()
|
||||
print("=" * 100)
|
||||
print("MEAN PREDICATE DELTA BY GROUP")
|
||||
print("=" * 100)
|
||||
print()
|
||||
for binary in ['single', 'multi']:
|
||||
items = [j for j in joined if j['binary'] == binary]
|
||||
if not items:
|
||||
continue
|
||||
n = len(items)
|
||||
mean_dp = sum(j['delta_preds'] for j in items) / n
|
||||
mean_de = sum(j['delta_edges'] for j in items) / n
|
||||
sum_pp = sum(j['prod_preds'] for j in items)
|
||||
sum_cp = sum(j['cascade_preds'] for j in items)
|
||||
pct_change = (sum_cp - sum_pp) / sum_pp * 100 if sum_pp else 0
|
||||
print(f"{binary}-domain (n={n}):")
|
||||
print(f" Mean Δpredicates per source: {mean_dp:+.2f}")
|
||||
print(f" Mean Δedges per source: {mean_de:+.2f}")
|
||||
print(f" Aggregate predicate change: {sum_pp} → {sum_cp} ({pct_change:+.1f}%)")
|
||||
print()
|
||||
|
||||
# Save joined data for the experiments log writeup
|
||||
out_path = "/home/aaron/aaronai/experiments/e16_joined_analysis.json"
|
||||
with open(out_path, "w") as f:
|
||||
json.dump(joined, f, indent=2)
|
||||
print(f"Joined data saved to {out_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,206 @@
|
||||
"""
|
||||
E1.6 domain-purity rating interface — with full metadata context.
|
||||
"""
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
|
||||
E14_RESULTS = "/home/aaron/aaronai/experiments/e14_cascade_results.json"
|
||||
RATINGS_OUT = "/home/aaron/aaronai/experiments/e16_purity_ratings.json"
|
||||
|
||||
INTRO = """
|
||||
================================================================================
|
||||
E1.6 — DOMAIN-PURITY RATING
|
||||
================================================================================
|
||||
|
||||
Two ratings per source:
|
||||
|
||||
1. BINARY — single-domain (s) or multi-domain (m)?
|
||||
|
||||
Mental test: "If Mistral had to pick ONE domain class for this source,
|
||||
would picking just one significantly UNDER-DESCRIBE the content?"
|
||||
|
||||
YES → MULTI-DOMAIN (m) — content lives across two+ frames meaningfully
|
||||
NO → SINGLE-DOMAIN (s) — content fits cleanly within one frame
|
||||
|
||||
2. SCORE (1-5) — how cleanly does it fit?
|
||||
|
||||
5 = unambiguously one domain
|
||||
4 = primarily one domain, slight other element
|
||||
3 = balanced two-domain
|
||||
2 = primarily two-domain with traces of a third
|
||||
1 = three or more domain frames weighted significantly
|
||||
|
||||
Single binary usually = score 4-5
|
||||
Multi binary usually = score 1-3
|
||||
|
||||
You see for each source: name, length, AND the full Mistral metadata block
|
||||
(domain_class, primary_format, structural_signals, content_signals, summary).
|
||||
|
||||
Blind to: bucket assignment, cascade outcome.
|
||||
|
||||
Commands at any prompt: 's', 'm', 'skip', 'quit'
|
||||
================================================================================
|
||||
""".strip()
|
||||
|
||||
|
||||
def load_existing():
|
||||
if os.path.exists(RATINGS_OUT):
|
||||
with open(RATINGS_OUT) as f:
|
||||
return json.load(f)
|
||||
return {"ratings": [], "completed_names": []}
|
||||
|
||||
def save(data):
|
||||
with open(RATINGS_OUT, "w") as f:
|
||||
json.dump(data, f, indent=2)
|
||||
|
||||
def render_metadata(metadata):
|
||||
"""Pretty-print the full Mistral metadata block."""
|
||||
if not isinstance(metadata, dict):
|
||||
print(" (metadata unavailable)")
|
||||
return
|
||||
if 'error' in metadata:
|
||||
print(f" (metadata error: {metadata['error']})")
|
||||
return
|
||||
|
||||
# Render fields in a stable order
|
||||
field_order = [
|
||||
'domain_class',
|
||||
'primary_format',
|
||||
'structural_signals',
|
||||
'content_signals',
|
||||
'summary',
|
||||
]
|
||||
for field in field_order:
|
||||
if field in metadata:
|
||||
value = metadata[field]
|
||||
label = field.replace('_', ' ').title()
|
||||
if isinstance(value, list):
|
||||
if value:
|
||||
print(f" {label}:")
|
||||
for item in value:
|
||||
print(f" - {item}")
|
||||
else:
|
||||
print(f" {label}: (none)")
|
||||
elif isinstance(value, str):
|
||||
# Wrap long strings
|
||||
if len(value) > 70:
|
||||
print(f" {label}:")
|
||||
print(f" {value}")
|
||||
else:
|
||||
print(f" {label}: {value}")
|
||||
else:
|
||||
print(f" {label}: {value}")
|
||||
|
||||
# Show any other fields not in the standard order
|
||||
other_fields = [k for k in metadata.keys() if k not in field_order and k != 'char_length']
|
||||
for field in other_fields:
|
||||
value = metadata[field]
|
||||
label = field.replace('_', ' ').title()
|
||||
print(f" {label}: {value}")
|
||||
|
||||
def render_source(src, idx, total):
|
||||
print()
|
||||
print("=" * 80)
|
||||
print(f" Source {idx}/{total}")
|
||||
print("=" * 80)
|
||||
print(f"Name: {src['name']}")
|
||||
print(f"Length: {src['doc_chars']:,} chars")
|
||||
print()
|
||||
print("Mistral metadata:")
|
||||
print()
|
||||
render_metadata(src.get('metadata', {}))
|
||||
print()
|
||||
print("-" * 80)
|
||||
|
||||
def get_rating():
|
||||
while True:
|
||||
binary = input("Single-domain or multi-domain? [s/m/skip/quit]: ").strip().lower()
|
||||
if binary in ('s', 'm', 'skip', 'quit'):
|
||||
break
|
||||
print(" Please enter 's', 'm', 'skip', or 'quit'")
|
||||
|
||||
if binary == 'quit':
|
||||
return 'quit'
|
||||
if binary == 'skip':
|
||||
return None
|
||||
|
||||
while True:
|
||||
try:
|
||||
score_input = input("Purity score (1=many frames, 5=clearly single): ").strip()
|
||||
if score_input.lower() == 'quit':
|
||||
return 'quit'
|
||||
score = int(score_input)
|
||||
if 1 <= score <= 5:
|
||||
break
|
||||
print(" Score must be 1-5")
|
||||
except ValueError:
|
||||
print(" Please enter a number 1-5 (or 'quit')")
|
||||
|
||||
note = input("Optional note (Enter to skip): ").strip()
|
||||
|
||||
return {
|
||||
"binary": "single" if binary == 's' else "multi",
|
||||
"score": score,
|
||||
"note": note if note else None,
|
||||
}
|
||||
|
||||
def main():
|
||||
with open(E14_RESULTS) as f:
|
||||
e14 = json.load(f)
|
||||
|
||||
sources = [r for r in e14['results'] if 'submit_result' in r]
|
||||
rng = random.Random(42)
|
||||
shuffled = list(sources)
|
||||
rng.shuffle(shuffled)
|
||||
|
||||
state = load_existing()
|
||||
completed = set(state['completed_names'])
|
||||
remaining = [s for s in shuffled if s['name'] not in completed]
|
||||
|
||||
print(INTRO)
|
||||
print()
|
||||
print(f"Total sources: {len(sources)}")
|
||||
print(f"Already rated: {len(completed)}")
|
||||
print(f"Remaining: {len(remaining)}")
|
||||
print()
|
||||
if not remaining:
|
||||
print("All sources rated. Run analysis script next.")
|
||||
return
|
||||
|
||||
input("Press Enter to begin...")
|
||||
|
||||
try:
|
||||
for i, src in enumerate(remaining, start=len(completed) + 1):
|
||||
render_source(src, i, len(sources))
|
||||
try:
|
||||
rating = get_rating()
|
||||
except (KeyboardInterrupt, EOFError):
|
||||
print("\n\nSaving and exiting...")
|
||||
save(state)
|
||||
return
|
||||
|
||||
if rating == 'quit':
|
||||
print("\nSaving and exiting...")
|
||||
save(state)
|
||||
return
|
||||
if rating is None:
|
||||
print(" Skipped")
|
||||
continue
|
||||
|
||||
rating['name'] = src['name']
|
||||
state['ratings'].append(rating)
|
||||
state['completed_names'].append(src['name'])
|
||||
save(state)
|
||||
print(f" Recorded: {rating['binary']}-domain, score={rating['score']}")
|
||||
|
||||
print()
|
||||
print("=" * 80)
|
||||
print(f"Done. Rated {len(state['ratings'])} sources.")
|
||||
print(f"Saved to {RATINGS_OUT}")
|
||||
except (KeyboardInterrupt, EOFError):
|
||||
print("\n\nSaving...")
|
||||
save(state)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,190 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
E1.8 Phase 2 — Evaluate
|
||||
Pulls predicate counts from FalkorDB for each group_id and compares.
|
||||
Run after e1_8_taxfree_cascade.py completes.
|
||||
"""
|
||||
|
||||
import json, subprocess
|
||||
from pathlib import Path
|
||||
|
||||
RESULTS_PATH = Path.home() / "aaronai" / "experiments" / "e1_8_results.json"
|
||||
EVAL_PATH = Path.home() / "aaronai" / "experiments" / "e1_8_eval.json"
|
||||
|
||||
GROUP_TAXFREE = "aaron_e18_taxfree"
|
||||
GROUP_BASELINE = "aaron_e18_baseline"
|
||||
GROUP_STANDARD = "aaron_e18_standard"
|
||||
GROUP_PROD = "aaron"
|
||||
GROUP_E14 = "aaron_cascade_e14"
|
||||
|
||||
|
||||
def query(group_id, cypher):
|
||||
result = subprocess.run(
|
||||
["docker", "exec", "falkordb", "redis-cli", "GRAPH.QUERY", group_id, cypher],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
return result.stdout
|
||||
|
||||
|
||||
def get_episode_uuid(group_id, episode_name):
|
||||
safe = episode_name.replace("'", "\'")
|
||||
cypher = f"MATCH (e:Episodic) WHERE e.name = '{safe}' RETURN e.uuid LIMIT 1"
|
||||
output = query(group_id, cypher)
|
||||
for line in output.split("\n"):
|
||||
line = line.strip()
|
||||
if len(line) == 36 and line.count("-") == 4:
|
||||
return line
|
||||
return None
|
||||
|
||||
|
||||
def count_preds(group_id, uuid):
|
||||
cypher = f"MATCH ()-[r:RELATES_TO]->() WHERE '{uuid}' IN r.episodes RETURN count(distinct r.name) AS p"
|
||||
output = query(group_id, cypher)
|
||||
for line in output.split("\n"):
|
||||
line = line.strip()
|
||||
if line.isdigit():
|
||||
return int(line)
|
||||
return 0
|
||||
|
||||
|
||||
def count_edges(group_id, uuid):
|
||||
cypher = f"MATCH ()-[r:RELATES_TO]->() WHERE '{uuid}' IN r.episodes RETURN count(r) AS n"
|
||||
output = query(group_id, cypher)
|
||||
for line in output.split("\n"):
|
||||
line = line.strip()
|
||||
if line.isdigit():
|
||||
return int(line)
|
||||
return 0
|
||||
|
||||
|
||||
def eval_source(name, groups):
|
||||
result = {"name": name}
|
||||
for label, group_id in groups.items():
|
||||
uuid = get_episode_uuid(group_id, name)
|
||||
if uuid:
|
||||
result[f"{label}_preds"] = count_preds(group_id, uuid)
|
||||
result[f"{label}_edges"] = count_edges(group_id, uuid)
|
||||
else:
|
||||
result[f"{label}_preds"] = None
|
||||
result[f"{label}_edges"] = None
|
||||
return result
|
||||
|
||||
|
||||
def run():
|
||||
print("E1.8 — Evaluation phase")
|
||||
print("=" * 60)
|
||||
|
||||
results = json.loads(RESULTS_PATH.read_text())
|
||||
eval_results = {"subsample_a": [], "subsample_b": []}
|
||||
|
||||
# Sub-sample A — compare taxfree vs prod (baseline) vs e14 cascade
|
||||
print("\nSub-sample A")
|
||||
print(f"{'Source':<55} {'prod':>5} {'e14c':>5} {'tf':>5} {'e14Δ':>6} {'tfΔ':>6}")
|
||||
print("-" * 90)
|
||||
|
||||
a_records = []
|
||||
for item in results["subsample_a"]:
|
||||
name = item["name"]
|
||||
r = eval_source(name, {
|
||||
"prod": GROUP_PROD,
|
||||
"e14": GROUP_E14,
|
||||
"tf": GROUP_TAXFREE,
|
||||
})
|
||||
r["bucket"] = item["bucket"]
|
||||
r["taxfree_metadata"] = item.get("taxfree_metadata")
|
||||
r["e14_delta_preds"] = item.get("e14_delta_preds")
|
||||
|
||||
prod = r.get("prod_preds") or 0
|
||||
e14 = r.get("e14_preds") or 0
|
||||
tf = r.get("tf_preds") or 0
|
||||
e14_delta = ((e14 - prod) / prod * 100) if prod > 0 else 0
|
||||
tf_delta = ((tf - prod) / prod * 100) if prod > 0 else 0
|
||||
|
||||
display = name[:53] + ".." if len(name) > 55 else name
|
||||
print(f"{display:<55} {prod:>5} {e14:>5} {tf:>5} {e14_delta:>+5.0f}% {tf_delta:>+5.0f}%")
|
||||
|
||||
r["tf_delta_vs_prod"] = tf_delta
|
||||
r["e14_delta_vs_prod"] = e14_delta
|
||||
a_records.append(r)
|
||||
eval_results["subsample_a"].append(r)
|
||||
|
||||
# Aggregate Sub-sample A
|
||||
valid = [r for r in a_records if r.get("prod_preds") and r.get("tf_preds")]
|
||||
if valid:
|
||||
mean_e14_delta = sum(r["e14_delta_vs_prod"] for r in valid) / len(valid)
|
||||
mean_tf_delta = sum(r["tf_delta_vs_prod"] for r in valid) / len(valid)
|
||||
print(f"\nAggregate Sub-sample A (n={len(valid)}):")
|
||||
print(f" E1.4 cascade mean delta vs prod: {mean_e14_delta:+.1f}%")
|
||||
print(f" Taxonomy-free mean delta vs prod: {mean_tf_delta:+.1f}%")
|
||||
print(f" Taxonomy-free vs E1.4 cascade: {mean_tf_delta - mean_e14_delta:+.1f}pp")
|
||||
|
||||
# Sub-sample B — all three conditions
|
||||
print("\n\nSub-sample B")
|
||||
print(f"{'Source':<55} {'base':>5} {'std':>5} {'tf':>5} {'stdΔ':>6} {'tfΔ':>6}")
|
||||
print("-" * 90)
|
||||
|
||||
b_records = []
|
||||
for item in results["subsample_b"]:
|
||||
name = item["name"]
|
||||
r = eval_source(name, {
|
||||
"base": GROUP_BASELINE,
|
||||
"std": GROUP_STANDARD,
|
||||
"tf": GROUP_TAXFREE,
|
||||
})
|
||||
r["bucket"] = item["bucket"]
|
||||
r["taxfree_metadata"] = item.get("taxfree_metadata")
|
||||
r["standard_metadata"] = item.get("standard_metadata")
|
||||
|
||||
base = r.get("base_preds") or 0
|
||||
std = r.get("std_preds") or 0
|
||||
tf = r.get("tf_preds") or 0
|
||||
std_delta = ((std - base) / base * 100) if base > 0 else 0
|
||||
tf_delta = ((tf - base) / base * 100) if base > 0 else 0
|
||||
|
||||
display = name[:53] + ".." if len(name) > 55 else name
|
||||
print(f"{display:<55} {base:>5} {std:>5} {tf:>5} {std_delta:>+5.0f}% {tf_delta:>+5.0f}%")
|
||||
|
||||
r["std_delta_vs_base"] = std_delta
|
||||
r["tf_delta_vs_base"] = tf_delta
|
||||
b_records.append(r)
|
||||
eval_results["subsample_b"].append(r)
|
||||
|
||||
# Aggregate Sub-sample B
|
||||
valid_b = [r for r in b_records if r.get("base_preds") and r.get("tf_preds")]
|
||||
if valid_b:
|
||||
mean_std_delta = sum(r["std_delta_vs_base"] for r in valid_b) / len(valid_b)
|
||||
mean_tf_delta = sum(r["tf_delta_vs_base"] for r in valid_b) / len(valid_b)
|
||||
print(f"\nAggregate Sub-sample B (n={len(valid_b)}):")
|
||||
print(f" Standard cascade mean delta vs baseline: {mean_std_delta:+.1f}%")
|
||||
print(f" Taxonomy-free mean delta vs baseline: {mean_tf_delta:+.1f}%")
|
||||
|
||||
# By bucket
|
||||
print("\nPer-bucket (Sub-sample B):")
|
||||
for bucket in ["high", "mid", "document"]:
|
||||
br = [r for r in valid_b if r["bucket"] == bucket]
|
||||
if not br:
|
||||
continue
|
||||
m_std = sum(r["std_delta_vs_base"] for r in br) / len(br)
|
||||
m_tf = sum(r["tf_delta_vs_base"] for r in br) / len(br)
|
||||
print(f" [{bucket:>8}] n={len(br)} std={m_std:+.0f}% tf={m_tf:+.0f}%")
|
||||
|
||||
# Decision rule evaluation
|
||||
print("\n" + "=" * 60)
|
||||
print("DECISION RULE:")
|
||||
if valid:
|
||||
improvement = mean_tf_delta - mean_e14_delta
|
||||
if improvement >= 20:
|
||||
print(f" ✓ STRONG RECOVERY (+{improvement:.1f}pp) — Stage 3.1 ships as taxonomy-free")
|
||||
elif improvement >= 5:
|
||||
print(f" ~ PARTIAL RECOVERY (+{improvement:.1f}pp) — orientation helps, needs refinement")
|
||||
elif improvement >= 0:
|
||||
print(f" ~ MARGINAL (+{improvement:.1f}pp) — consider API extractor prompt redesign (E1.9)")
|
||||
else:
|
||||
print(f" ✗ NEGATIVE ({improvement:.1f}pp) — taxonomy-free introduces more noise than standard")
|
||||
|
||||
EVAL_PATH.write_text(json.dumps(eval_results, indent=2))
|
||||
print(f"\nEval saved to {EVAL_PATH}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run()
|
||||
@@ -0,0 +1,285 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
E1.8 Phase 1 — Ingest
|
||||
Runs taxonomy-free and standard cascade ingestion for Sub-samples A and B.
|
||||
Run this first, then run e1_8_eval.py to pull predicate counts.
|
||||
"""
|
||||
|
||||
import os, json, time, psycopg2, requests
|
||||
from pathlib import Path
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
GRAPHITI_URL = "http://localhost:8001"
|
||||
RESULTS_PATH = Path.home() / "aaronai" / "experiments" / "e1_8_results.json"
|
||||
|
||||
GROUP_TAXFREE = "aaron_e18_taxfree"
|
||||
GROUP_BASELINE = "aaron_e18_baseline"
|
||||
GROUP_STANDARD = "aaron_e18_standard"
|
||||
|
||||
TAXFREE_PROMPT = """You are a metadata extraction system. Given a document, describe its content shape for use as orientation context in a knowledge graph extraction pass.
|
||||
|
||||
Do not summarize content. Do not extract entities. Do not assign a single category label.
|
||||
|
||||
Instead, describe:
|
||||
- What domains or frames are active in this content (there may be several simultaneously)
|
||||
- How those frames relate to each other in this specific document
|
||||
- What kind of relational content a knowledge graph extractor should look for
|
||||
|
||||
Output JSON only. No prose, no explanation, no markdown.
|
||||
|
||||
Schema:
|
||||
{
|
||||
"active_frames": ["<frame 1>", "<frame 2>", ...],
|
||||
"frame_relationships": "<one sentence describing how the frames interact in this document>",
|
||||
"extraction_orientation": "<one sentence orienting the extractor toward the most relationship-rich content>",
|
||||
"one_sentence_summary": "<one sentence describing what the document is about>"
|
||||
}
|
||||
|
||||
Document:
|
||||
"""
|
||||
|
||||
STANDARD_PROMPT = """You are a metadata extraction system. Given a document, produce structural and content metadata in strict JSON format.
|
||||
|
||||
Do not summarize the content beyond the one-sentence summary field. Do not extract entities or relationships. Do not interpret meaning. Produce only the metadata schema below.
|
||||
|
||||
Output JSON only. No prose, no explanation, no markdown code fences.
|
||||
|
||||
Schema:
|
||||
{
|
||||
"language": "<ISO 639-1 code>",
|
||||
"char_length": <integer>,
|
||||
"primary_format": "<prose|slides|code|structured|mixed>",
|
||||
"structural_signals": {
|
||||
"has_headings": <boolean>,
|
||||
"has_bullet_lists": <boolean>,
|
||||
"has_numbered_lists": <boolean>,
|
||||
"has_tables": <boolean>,
|
||||
"has_code_blocks": <boolean>,
|
||||
"has_dates": <boolean>
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": <boolean>,
|
||||
"has_institutional_language": <boolean>,
|
||||
"has_technical_terminology": <boolean>,
|
||||
"has_first_person": <boolean>,
|
||||
"has_quotations": <boolean>
|
||||
},
|
||||
"domain_class": "<technical|administrative|educational|personal|conversational>",
|
||||
"one_sentence_summary": "<one sentence describing what the document is about>"
|
||||
}
|
||||
|
||||
Document:
|
||||
"""
|
||||
|
||||
SUBSAMPLE_A = [
|
||||
{"name": "Claude: Lubbock on everything album lyrics", "bucket": "high"},
|
||||
{"name": "ChatGPT: Tulsa Concept Album Guide", "bucket": "high"},
|
||||
{"name": "ChatGPT: Rhino 3D object flow", "bucket": "high"},
|
||||
{"name": "Claude: SUNY faculty conflict of interest policies", "bucket": "mid"},
|
||||
{"name": "Claude: Interview presentation research and preparation", "bucket": "mid"},
|
||||
{"name": "Claude: Research Statement Restructure", "bucket": "mid"},
|
||||
{"name": "ChatGPT: Respect Individual Interests for Christmas", "bucket": "low"},
|
||||
{"name": "University of North Texas Cover letter.pdf", "bucket": "document"},
|
||||
{"name": "Claude: Finding ideal rural housing near University of Utah", "bucket": "high"},
|
||||
{"name": "ChatGPT: SEC coaches with OSU ties", "bucket": "high"},
|
||||
{"name": "Claude: Bonding ASA 3D printed parts", "bucket": "mid"},
|
||||
{"name": "ChatGPT: Title: User request summary.", "bucket": "low"},
|
||||
{"name": "ChatGPT: Scholarship Recommendation Letter Tips", "bucket": "low"},
|
||||
]
|
||||
|
||||
SUBSAMPLE_B = [
|
||||
{"name": "ChatGPT: Job application comparison", "bucket": "high"},
|
||||
{"name": "ChatGPT: External review for tenure", "bucket": "high"},
|
||||
{"name": "Claude: University of Utah interview teaching example", "bucket": "high"},
|
||||
{"name": "ChatGPT: Starting Dropship Gun Business", "bucket": "high"},
|
||||
{"name": "ChatGPT: Analyze business plan", "bucket": "high"},
|
||||
{"name": "ChatGPT: Outdoor Layering Explained", "bucket": "mid"},
|
||||
{"name": "ChatGPT: Limits in Calculus.", "bucket": "mid"},
|
||||
{"name": "ChatGPT: Academic Program Director Role", "bucket": "mid"},
|
||||
{"name": "ChatGPT: Lonely Island Poop Skit", "bucket": "mid"},
|
||||
{"name": "ChatGPT: Parse Tidal playlist", "bucket": "mid"},
|
||||
{"name": "NO thesis proposal.pdf", "bucket": "document"},
|
||||
{"name": "PWM.pdf", "bucket": "document"},
|
||||
{"name": "Will_It_Print.pdf", "bucket": "document"},
|
||||
{"name": "Kim Kedem Ind Study F2025 Syllabus.docx", "bucket": "document"},
|
||||
{"name": "Aaron Nelson Graduate Transcript.pdf", "bucket": "document"},
|
||||
]
|
||||
|
||||
|
||||
def get_pg():
|
||||
return psycopg2.connect(PG_DSN)
|
||||
|
||||
|
||||
def get_document_text(source_name):
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute("SELECT document FROM embeddings WHERE source = %s ORDER BY id LIMIT 20", (source_name,))
|
||||
rows = cur.fetchall()
|
||||
pg.close()
|
||||
return " ".join(r[0] for r in rows)[:12000]
|
||||
|
||||
|
||||
def run_mistral(prompt_prefix, doc_text, label=""):
|
||||
print(f" → Mistral {label} running...", flush=True)
|
||||
payload = {"model": "mistral:latest", "prompt": prompt_prefix + doc_text, "stream": False, "format": "json"}
|
||||
resp = requests.post("http://localhost:11434/api/generate", json=payload, timeout=300)
|
||||
resp.raise_for_status()
|
||||
raw = resp.json().get("response", "{}")
|
||||
print(f" → Mistral {label} done ({len(raw)} chars)", flush=True)
|
||||
try:
|
||||
return json.loads(raw)
|
||||
except Exception:
|
||||
return {"error": "parse_failed", "raw": raw[:200]}
|
||||
|
||||
|
||||
def build_taxfree_orientation(meta):
|
||||
frames = ", ".join(meta.get("active_frames", []))
|
||||
rel = meta.get("frame_relationships", "")
|
||||
orient = meta.get("extraction_orientation", "")
|
||||
summary = meta.get("one_sentence_summary", "")
|
||||
return f"Active frames: {frames}. Frame relationships: {rel} Extraction focus: {orient} Summary: {summary}"
|
||||
|
||||
|
||||
def build_standard_orientation(meta):
|
||||
dc = meta.get("domain_class", "unknown")
|
||||
pf = meta.get("primary_format", "unknown")
|
||||
summary = meta.get("one_sentence_summary", "")
|
||||
cs = meta.get("content_signals", {})
|
||||
return (f"domain_class: {dc}\nprimary_format: {pf}\none_sentence_summary: {summary}\n"
|
||||
f"has_named_people: {cs.get('has_named_people', False)}\n"
|
||||
f"has_technical_terminology: {cs.get('has_technical_terminology', False)}")
|
||||
|
||||
|
||||
def ingest(source_name, doc_text, orientation, group_id):
|
||||
payload = {
|
||||
"episodes": [{
|
||||
"name": source_name,
|
||||
"content": doc_text[:12000],
|
||||
"source_description": orientation,
|
||||
"timestamp": "2026-04-28T00:00:00",
|
||||
}],
|
||||
"group_id": group_id,
|
||||
}
|
||||
resp = requests.post(f"{GRAPHITI_URL}/episodes/bulk", json=payload, timeout=300)
|
||||
resp.raise_for_status()
|
||||
|
||||
|
||||
def save(results):
|
||||
RESULTS_PATH.write_text(json.dumps(results, indent=2))
|
||||
|
||||
|
||||
def run():
|
||||
print("E1.8 — Ingest phase")
|
||||
print("=" * 60)
|
||||
|
||||
# Load existing results if resuming
|
||||
if RESULTS_PATH.exists():
|
||||
results = json.loads(RESULTS_PATH.read_text())
|
||||
done_a = {r["name"] for r in results.get("subsample_a", [])}
|
||||
done_b = {r["name"] for r in results.get("subsample_b", [])}
|
||||
print(f"Resuming: {len(done_a)} A done, {len(done_b)} B done")
|
||||
else:
|
||||
results = {"subsample_a": [], "subsample_b": []}
|
||||
done_a, done_b = set(), set()
|
||||
|
||||
e14_data = json.loads((Path.home() / "aaronai" / "experiments" / "e14_per_source_comparison.json").read_text())
|
||||
e14_by_name = {s["name"]: s for s in e14_data}
|
||||
|
||||
# Sub-sample A — taxonomy-free only (baseline + standard from E1.4)
|
||||
print("\nSub-sample A — taxonomy-free ingestion only")
|
||||
for item in SUBSAMPLE_A:
|
||||
name = item["name"]
|
||||
if name in done_a:
|
||||
print(f" SKIP (done): {name}")
|
||||
continue
|
||||
print(f"\n {name}")
|
||||
doc_text = get_document_text(name)
|
||||
if not doc_text:
|
||||
print(f" SKIP — no text")
|
||||
continue
|
||||
|
||||
tf_meta = run_mistral(TAXFREE_PROMPT, doc_text, "taxfree")
|
||||
print(f" frames: {tf_meta.get('active_frames', 'ERROR')}")
|
||||
orientation = build_taxfree_orientation(tf_meta)
|
||||
|
||||
try:
|
||||
ingest(name, doc_text, orientation, GROUP_TAXFREE)
|
||||
time.sleep(3)
|
||||
print(f" ingested to {GROUP_TAXFREE}")
|
||||
except Exception as e:
|
||||
print(f" ingest failed: {e}")
|
||||
continue
|
||||
|
||||
e14 = e14_by_name.get(name, {})
|
||||
results["subsample_a"].append({
|
||||
"name": name,
|
||||
"bucket": item["bucket"],
|
||||
"taxfree_metadata": tf_meta,
|
||||
"taxfree_orientation": orientation,
|
||||
"e14_prod_preds": e14.get("prod_preds"),
|
||||
"e14_cascade_preds": e14.get("cascade_preds"),
|
||||
"e14_delta_preds": e14.get("delta_preds"),
|
||||
"e14_prod_edges": e14.get("prod_edges"),
|
||||
"e14_cascade_edges": e14.get("cascade_edges"),
|
||||
"e14_delta_edges": e14.get("delta_edges"),
|
||||
})
|
||||
save(results)
|
||||
|
||||
# Sub-sample B — all three conditions
|
||||
print("\nSub-sample B — all three conditions")
|
||||
for item in SUBSAMPLE_B:
|
||||
name = item["name"]
|
||||
if name in done_b:
|
||||
print(f" SKIP (done): {name}")
|
||||
continue
|
||||
print(f"\n {name} ({item['bucket']})")
|
||||
doc_text = get_document_text(name)
|
||||
if not doc_text:
|
||||
print(f" SKIP — no text")
|
||||
continue
|
||||
|
||||
entry = {"name": name, "bucket": item["bucket"],
|
||||
"taxfree_metadata": None, "standard_metadata": None}
|
||||
|
||||
# Baseline
|
||||
try:
|
||||
ingest(name, doc_text, "", GROUP_BASELINE)
|
||||
time.sleep(3)
|
||||
print(f" baseline ingested")
|
||||
except Exception as e:
|
||||
print(f" baseline failed: {e}")
|
||||
|
||||
# Standard
|
||||
std_meta = run_mistral(STANDARD_PROMPT, doc_text, "standard")
|
||||
entry["standard_metadata"] = std_meta
|
||||
try:
|
||||
ingest(name, doc_text, build_standard_orientation(std_meta), GROUP_STANDARD)
|
||||
time.sleep(3)
|
||||
print(f" standard ingested, domain_class={std_meta.get('domain_class','?')}")
|
||||
except Exception as e:
|
||||
print(f" standard failed: {e}")
|
||||
|
||||
# Taxonomy-free
|
||||
tf_meta = run_mistral(TAXFREE_PROMPT, doc_text, "taxfree")
|
||||
entry["taxfree_metadata"] = tf_meta
|
||||
print(f" frames: {tf_meta.get('active_frames', 'ERROR')}")
|
||||
try:
|
||||
ingest(name, doc_text, build_taxfree_orientation(tf_meta), GROUP_TAXFREE)
|
||||
time.sleep(3)
|
||||
print(f" taxfree ingested")
|
||||
except Exception as e:
|
||||
print(f" taxfree failed: {e}")
|
||||
|
||||
results["subsample_b"].append(entry)
|
||||
save(results)
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print(f"Ingest complete. Results at {RESULTS_PATH}")
|
||||
print("Now run: python3 ~/aaronai/scripts/experiments/e1_8_eval.py")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run()
|
||||
@@ -0,0 +1,204 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
E1.9 Phase 1 — Retroactive validation
|
||||
For each E1.8 source, query the production graph with frame_relationships
|
||||
to get a coverage score, then check whether the routing tier prediction
|
||||
matches the actual best-performing condition from E1.8.
|
||||
No API spend required — uses existing E1.8 data and Graphiti search only.
|
||||
"""
|
||||
|
||||
import json, requests
|
||||
from pathlib import Path
|
||||
|
||||
GRAPHITI_URL = "http://localhost:8001"
|
||||
E18_PATH = Path.home() / "aaronai" / "experiments" / "e1_8_eval.json"
|
||||
E18_INGEST_PATH = Path.home() / "aaronai" / "experiments" / "e1_8_results.json"
|
||||
RESULTS_PATH = Path.home() / "aaronai" / "experiments" / "e1_9_retroactive.json"
|
||||
|
||||
# Routing thresholds
|
||||
HIGH_THRESHOLD = 0.70 # baseline
|
||||
LOW_THRESHOLD = 0.40 # taxonomy-free
|
||||
|
||||
|
||||
def get_coverage_score(query, group_id="aaron"):
|
||||
"""Query production graph and return coverage score based on result count.
|
||||
Score: 0 = no results, 0.33 = 1 result, 0.66 = 2 results, 1.0 = 3+ results.
|
||||
Uses result count because Graphiti fulltext search returns score=0 for all hits.
|
||||
"""
|
||||
if not query or not query.strip():
|
||||
return 0.0
|
||||
try:
|
||||
resp = requests.get(
|
||||
f"{GRAPHITI_URL}/search",
|
||||
params={"query": query, "limit": 3, "group_id": group_id},
|
||||
timeout=30
|
||||
)
|
||||
resp.raise_for_status()
|
||||
results = resp.json().get("results", [])
|
||||
n = len(results)
|
||||
return min(n / 3.0, 1.0)
|
||||
except Exception as e:
|
||||
print(f" Search error: {e}")
|
||||
return 0.0
|
||||
|
||||
|
||||
def assign_tier(coverage_score):
|
||||
if coverage_score >= HIGH_THRESHOLD:
|
||||
return "baseline"
|
||||
elif coverage_score >= LOW_THRESHOLD:
|
||||
return "standard"
|
||||
else:
|
||||
return "taxfree"
|
||||
|
||||
|
||||
def best_condition_from_e18(record, subsample):
|
||||
"""
|
||||
Determine which condition actually performed best for this source in E1.8.
|
||||
Sub-sample A: compare prod (baseline), e14 (standard cascade), tf (taxfree)
|
||||
Sub-sample B: compare base, std, tf
|
||||
"""
|
||||
if subsample == "a":
|
||||
prod = record.get("prod_preds") or 0
|
||||
e14 = record.get("e14_preds") or 0
|
||||
tf = record.get("tf_preds") or 0
|
||||
best_score = max(prod, e14, tf)
|
||||
if best_score == 0:
|
||||
return "unknown"
|
||||
if tf == best_score:
|
||||
return "taxfree"
|
||||
elif e14 == best_score:
|
||||
return "standard"
|
||||
else:
|
||||
return "baseline"
|
||||
else:
|
||||
base = record.get("base_preds") or 0
|
||||
std = record.get("std_preds") or 0
|
||||
tf = record.get("tf_preds") or 0
|
||||
best_score = max(base, std, tf)
|
||||
if best_score == 0:
|
||||
return "unknown"
|
||||
if tf == best_score:
|
||||
return "taxfree"
|
||||
elif std == best_score:
|
||||
return "standard"
|
||||
else:
|
||||
return "baseline"
|
||||
|
||||
|
||||
def run():
|
||||
print("E1.9 Phase 1 — Retroactive validation")
|
||||
print("=" * 60)
|
||||
|
||||
e18_eval = json.loads(E18_PATH.read_text())
|
||||
e18_ingest = json.loads(E18_INGEST_PATH.read_text())
|
||||
|
||||
# Build frame_relationships lookup from ingest results
|
||||
fr_lookup = {}
|
||||
for item in e18_ingest.get("subsample_a", []):
|
||||
meta = item.get("taxfree_metadata", {})
|
||||
if meta:
|
||||
fr_lookup[item["name"]] = meta.get("frame_relationships", "")
|
||||
for item in e18_ingest.get("subsample_b", []):
|
||||
meta = item.get("taxfree_metadata", {})
|
||||
if meta:
|
||||
fr_lookup[item["name"]] = meta.get("frame_relationships", "")
|
||||
|
||||
results = []
|
||||
correct = 0
|
||||
total = 0
|
||||
|
||||
# Sub-sample A
|
||||
print("\nSub-sample A")
|
||||
print(f"{'Source':<50} {'cov':>5} {'tier':<10} {'predicted':<10} {'actual':<10} {'match'}")
|
||||
print("-" * 95)
|
||||
|
||||
for record in e18_eval["subsample_a"]:
|
||||
name = record["name"]
|
||||
fr = fr_lookup.get(name, "")
|
||||
coverage = get_coverage_score(fr)
|
||||
tier = assign_tier(coverage)
|
||||
actual_best = best_condition_from_e18(record, "a")
|
||||
match = "✓" if tier == actual_best else "✗"
|
||||
if actual_best != "unknown":
|
||||
total += 1
|
||||
if tier == actual_best:
|
||||
correct += 1
|
||||
display = name[:48] + ".." if len(name) > 50 else name
|
||||
print(f"{display:<50} {coverage:>5.2f} {tier:<10} {tier:<10} {actual_best:<10} {match}")
|
||||
results.append({
|
||||
"name": name, "subsample": "a", "bucket": record.get("bucket"),
|
||||
"frame_relationships": fr, "coverage_score": coverage,
|
||||
"predicted_tier": tier, "actual_best": actual_best, "match": tier == actual_best,
|
||||
})
|
||||
|
||||
# Sub-sample B
|
||||
print("\nSub-sample B")
|
||||
print(f"{'Source':<50} {'cov':>5} {'tier':<10} {'predicted':<10} {'actual':<10} {'match'}")
|
||||
print("-" * 95)
|
||||
|
||||
for record in e18_eval["subsample_b"]:
|
||||
name = record["name"]
|
||||
fr = fr_lookup.get(name, "")
|
||||
coverage = get_coverage_score(fr)
|
||||
tier = assign_tier(coverage)
|
||||
actual_best = best_condition_from_e18(record, "b")
|
||||
match = "✓" if tier == actual_best else "✗"
|
||||
if actual_best != "unknown":
|
||||
total += 1
|
||||
if tier == actual_best:
|
||||
correct += 1
|
||||
display = name[:48] + ".." if len(name) > 50 else name
|
||||
print(f"{display:<50} {coverage:>5.2f} {tier:<10} {tier:<10} {actual_best:<10} {match}")
|
||||
results.append({
|
||||
"name": name, "subsample": "b", "bucket": record.get("bucket"),
|
||||
"frame_relationships": fr, "coverage_score": coverage,
|
||||
"predicted_tier": tier, "actual_best": actual_best, "match": tier == actual_best,
|
||||
})
|
||||
|
||||
# Summary
|
||||
rate = correct / total * 100 if total > 0 else 0
|
||||
print(f"\n{'=' * 60}")
|
||||
print(f"Validation rate: {correct}/{total} ({rate:.1f}%)")
|
||||
print()
|
||||
if rate >= 70:
|
||||
print("✓ SIGNAL VALIDATED — coverage score predicts best condition")
|
||||
print(" Proceed to Phase 2 (new ingestion with routing)")
|
||||
elif rate >= 50:
|
||||
print("~ MARGINAL — adjust thresholds before Phase 2")
|
||||
print(" Review mismatch patterns below")
|
||||
else:
|
||||
print("✗ SIGNAL NOT PREDICTIVE — frame_relationships coverage")
|
||||
print(" may not be the right signal. Consider active_frames fallback.")
|
||||
|
||||
# Mismatch analysis
|
||||
mismatches = [r for r in results if not r["match"] and r["actual_best"] != "unknown"]
|
||||
if mismatches:
|
||||
print(f"\nMismatches ({len(mismatches)}):")
|
||||
for r in mismatches:
|
||||
print(f" [{r['bucket']:<8}] cov={r['coverage_score']:.2f} predicted={r['predicted_tier']} actual={r['actual_best']} | {r['name'][:50]}")
|
||||
|
||||
# Coverage score distribution
|
||||
scores = [r["coverage_score"] for r in results]
|
||||
print(f"\nCoverage score distribution:")
|
||||
print(f" Mean: {sum(scores)/len(scores):.2f}")
|
||||
print(f" Min: {min(scores):.2f}")
|
||||
print(f" Max: {max(scores):.2f}")
|
||||
high = sum(1 for s in scores if s >= HIGH_THRESHOLD)
|
||||
mid = sum(1 for s in scores if LOW_THRESHOLD <= s < HIGH_THRESHOLD)
|
||||
low = sum(1 for s in scores if s < LOW_THRESHOLD)
|
||||
print(f" Tier distribution: baseline={high} standard={mid} taxfree={low}")
|
||||
|
||||
# Save
|
||||
output = {
|
||||
"validation_rate": rate,
|
||||
"correct": correct,
|
||||
"total": total,
|
||||
"thresholds": {"high": HIGH_THRESHOLD, "low": LOW_THRESHOLD},
|
||||
"results": results,
|
||||
}
|
||||
RESULTS_PATH.write_text(json.dumps(output, indent=2))
|
||||
print(f"\nSaved to {RESULTS_PATH}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run()
|
||||
@@ -0,0 +1,134 @@
|
||||
#!/usr/bin/env python3
|
||||
"""E1 metrics comparison — A (Tier 1 aaron) vs B (cascade aaron_cascade_test) on the 10 sample sources."""
|
||||
import json
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
|
||||
EXPERIMENTS = Path.home() / "aaronai" / "experiments"
|
||||
SAMPLE_FILE = EXPERIMENTS / "cascade_reextract_sample.json"
|
||||
COMPARISON_FILE = EXPERIMENTS / "cascade_reextract_comparison.json"
|
||||
|
||||
def query(group_id, cypher):
|
||||
result = subprocess.run(
|
||||
["docker", "exec", "falkordb", "redis-cli", "GRAPH.QUERY", group_id, cypher],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
return result.stdout
|
||||
|
||||
def parse_int_result(output):
|
||||
"""Parse a single-integer result from redis-cli GRAPH.QUERY output."""
|
||||
lines = [l.strip() for l in output.split("\n") if l.strip()]
|
||||
for line in lines:
|
||||
if line.isdigit():
|
||||
return int(line)
|
||||
return 0
|
||||
|
||||
def parse_string_list(output):
|
||||
"""Parse a list of strings from redis-cli output (skipping headers and timing)."""
|
||||
lines = [l.strip() for l in output.split("\n") if l.strip()]
|
||||
items = []
|
||||
started = False
|
||||
for line in lines:
|
||||
if line.startswith("Cached") or line.startswith("Query internal"):
|
||||
break
|
||||
if started:
|
||||
items.append(line)
|
||||
# The header is the column name; everything after is data
|
||||
# But we don't know column names a priori, so detect transition by length pattern
|
||||
if not started and len(line) < 60 and not any(c in line for c in "{}[]"):
|
||||
# Likely a header row, skip first one
|
||||
started = True
|
||||
return items
|
||||
|
||||
def metrics_for_source(group_id, source_name):
|
||||
"""Get metrics for one source's episode in one group_id."""
|
||||
# Total entities connected to this episode
|
||||
q = f'MATCH (e:Episodic {{name: "{source_name}"}})-[]-(n:Entity) RETURN count(distinct n) AS entities'
|
||||
entities = parse_int_result(query(group_id, q))
|
||||
|
||||
# Total edges from this episode (all relationship types)
|
||||
q = f'MATCH (e:Episodic {{name: "{source_name}"}})-[r]-() RETURN count(r) AS edges'
|
||||
edges = parse_int_result(query(group_id, q))
|
||||
|
||||
# Distinct relationship types in edges from entities of this episode
|
||||
q = (f'MATCH (e:Episodic {{name: "{source_name}"}})-[]-(n:Entity)-[r]-() '
|
||||
f'RETURN count(distinct type(r)) AS types')
|
||||
rel_types = parse_int_result(query(group_id, q))
|
||||
|
||||
return {"entities": entities, "edges": edges, "rel_types": rel_types}
|
||||
|
||||
def main():
|
||||
with open(SAMPLE_FILE) as f:
|
||||
sample = json.load(f)
|
||||
selected = sample["selected"]
|
||||
|
||||
print(f"E1 metrics comparison — {len(selected)} sources, A=aaron vs B=aaron_cascade_test\n")
|
||||
print(f"{'Source':<60} {'A.ent':>6} {'B.ent':>6} {'A.edg':>6} {'B.edg':>6} {'A.typ':>6} {'B.typ':>6}")
|
||||
print("-" * 110)
|
||||
|
||||
results = []
|
||||
for ep in selected:
|
||||
name = ep["name"]
|
||||
bucket = ep["bucket"]
|
||||
a = metrics_for_source("aaron", name)
|
||||
b = metrics_for_source("aaron_cascade_test", name)
|
||||
record = {
|
||||
"name": name, "bucket": bucket,
|
||||
"a_entities": a["entities"], "b_entities": b["entities"],
|
||||
"a_edges": a["edges"], "b_edges": b["edges"],
|
||||
"a_rel_types": a["rel_types"], "b_rel_types": b["rel_types"],
|
||||
}
|
||||
results.append(record)
|
||||
# Truncate name for display
|
||||
display_name = name if len(name) <= 58 else name[:55] + "..."
|
||||
print(f"{display_name:<60} {a['entities']:>6} {b['entities']:>6} {a['edges']:>6} {b['edges']:>6} {a['rel_types']:>6} {b['rel_types']:>6}")
|
||||
|
||||
# Aggregates
|
||||
print("\n" + "=" * 110)
|
||||
n = len(results)
|
||||
a_ent_sum = sum(r["a_entities"] for r in results)
|
||||
b_ent_sum = sum(r["b_entities"] for r in results)
|
||||
a_edge_sum = sum(r["a_edges"] for r in results)
|
||||
b_edge_sum = sum(r["b_edges"] for r in results)
|
||||
a_types_sum = sum(r["a_rel_types"] for r in results)
|
||||
b_types_sum = sum(r["b_rel_types"] for r in results)
|
||||
print(f"\nAggregate (n={n}):")
|
||||
print(f" Entities: A mean={a_ent_sum/n:.1f} B mean={b_ent_sum/n:.1f} delta={(b_ent_sum-a_ent_sum)/a_ent_sum*100:+.1f}%")
|
||||
print(f" Edges: A mean={a_edge_sum/n:.1f} B mean={b_edge_sum/n:.1f} delta={(b_edge_sum-a_edge_sum)/a_edge_sum*100:+.1f}%")
|
||||
print(f" Rel types: A mean={a_types_sum/n:.1f} B mean={b_types_sum/n:.1f} delta={(b_types_sum-a_types_sum)/a_types_sum*100:+.1f}%")
|
||||
|
||||
# Global predicate diversity check (unique types in each group_id)
|
||||
print(f"\nGlobal predicate diversity:")
|
||||
a_global = parse_int_result(query("aaron", "MATCH ()-[r]-() RETURN count(distinct type(r)) AS t"))
|
||||
b_global = parse_int_result(query("aaron_cascade_test", "MATCH ()-[r]-() RETURN count(distinct type(r)) AS t"))
|
||||
print(f" A (aaron): {a_global} distinct relationship types across whole graph")
|
||||
print(f" B (aaron_cascade_test): {b_global} distinct relationship types across whole graph")
|
||||
|
||||
# Per-bucket
|
||||
print(f"\nPer-bucket aggregates:")
|
||||
for bucket in ["high", "mid", "low", "document"]:
|
||||
bucket_results = [r for r in results if r["bucket"] == bucket]
|
||||
if not bucket_results:
|
||||
continue
|
||||
bn = len(bucket_results)
|
||||
a_e = sum(r["a_entities"] for r in bucket_results) / bn
|
||||
b_e = sum(r["b_entities"] for r in bucket_results) / bn
|
||||
a_ed = sum(r["a_edges"] for r in bucket_results) / bn
|
||||
b_ed = sum(r["b_edges"] for r in bucket_results) / bn
|
||||
print(f" [{bucket:>8}] n={bn} A.ent={a_e:.1f} B.ent={b_e:.1f} ({(b_e-a_e)/a_e*100:+.0f}%) "
|
||||
f"A.edg={a_ed:.1f} B.edg={b_ed:.1f} ({(b_ed-a_ed)/a_ed*100:+.0f}%)")
|
||||
|
||||
with open(COMPARISON_FILE, "w") as f:
|
||||
json.dump({
|
||||
"results": results,
|
||||
"aggregate": {
|
||||
"a_entities_total": a_ent_sum, "b_entities_total": b_ent_sum,
|
||||
"a_edges_total": a_edge_sum, "b_edges_total": b_edge_sum,
|
||||
"global_predicate_diversity": {"a": a_global, "b": b_global},
|
||||
},
|
||||
}, f, indent=2)
|
||||
print(f"\nSaved to {COMPARISON_FILE}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,115 @@
|
||||
#!/usr/bin/env python3
|
||||
"""E1 corrected metric — count distinct predicate names on edges originating from each episode."""
|
||||
import json
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
|
||||
EXPERIMENTS = Path.home() / "aaronai" / "experiments"
|
||||
SAMPLE_FILE = EXPERIMENTS / "cascade_reextract_sample.json"
|
||||
|
||||
def query(group_id, cypher):
|
||||
result = subprocess.run(
|
||||
["docker", "exec", "falkordb", "redis-cli", "GRAPH.QUERY", group_id, cypher],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
return result.stdout
|
||||
|
||||
def get_episode_uuid(group_id, episode_name):
|
||||
"""Look up the UUID for a given episode name in a given group."""
|
||||
# Escape single quotes in the name
|
||||
safe = episode_name.replace("'", "\\'")
|
||||
cypher = f"MATCH (e:Episodic) WHERE e.name = '{safe}' RETURN e.uuid LIMIT 1"
|
||||
output = query(group_id, cypher)
|
||||
lines = [l.strip() for l in output.split("\n") if l.strip()]
|
||||
for line in lines:
|
||||
# UUID format check
|
||||
if len(line) == 36 and line.count("-") == 4:
|
||||
return line
|
||||
return None
|
||||
|
||||
def count_predicates_for_episode(group_id, uuid):
|
||||
"""Count distinct predicate names on edges where this episode UUID appears in r.episodes."""
|
||||
cypher = f"MATCH ()-[r:RELATES_TO]->() WHERE '{uuid}' IN r.episodes RETURN count(distinct r.name) AS p"
|
||||
output = query(group_id, cypher)
|
||||
lines = [l.strip() for l in output.split("\n") if l.strip()]
|
||||
for line in lines:
|
||||
if line.isdigit():
|
||||
return int(line)
|
||||
return 0
|
||||
|
||||
def count_total_edges_for_episode(group_id, uuid):
|
||||
"""Count total edges originating from this episode."""
|
||||
cypher = f"MATCH ()-[r:RELATES_TO]->() WHERE '{uuid}' IN r.episodes RETURN count(r) AS n"
|
||||
output = query(group_id, cypher)
|
||||
lines = [l.strip() for l in output.split("\n") if l.strip()]
|
||||
for line in lines:
|
||||
if line.isdigit():
|
||||
return int(line)
|
||||
return 0
|
||||
|
||||
with open(SAMPLE_FILE) as f:
|
||||
sample = json.load(f)
|
||||
selected = sample["selected"]
|
||||
|
||||
print(f"E1 corrected per-source comparison — predicates per episode by edge origin\n")
|
||||
print(f"{'Source':<60} {'A.edges':>8} {'A.preds':>8} {'B.edges':>8} {'B.preds':>8}")
|
||||
print("-" * 100)
|
||||
|
||||
a_pred_total = 0
|
||||
b_pred_total = 0
|
||||
a_edge_total = 0
|
||||
b_edge_total = 0
|
||||
records = []
|
||||
|
||||
for ep in selected:
|
||||
name = ep["name"]
|
||||
a_uuid = get_episode_uuid("aaron", name)
|
||||
b_uuid = get_episode_uuid("aaron_cascade_test", name)
|
||||
|
||||
a_edges = count_total_edges_for_episode("aaron", a_uuid) if a_uuid else 0
|
||||
a_preds = count_predicates_for_episode("aaron", a_uuid) if a_uuid else 0
|
||||
b_edges = count_total_edges_for_episode("aaron_cascade_test", b_uuid) if b_uuid else 0
|
||||
b_preds = count_predicates_for_episode("aaron_cascade_test", b_uuid) if b_uuid else 0
|
||||
|
||||
display = name if len(name) <= 58 else name[:55] + "..."
|
||||
print(f"{display:<60} {a_edges:>8} {a_preds:>8} {b_edges:>8} {b_preds:>8}")
|
||||
|
||||
records.append({
|
||||
"name": name, "bucket": ep["bucket"],
|
||||
"a_edges": a_edges, "a_preds": a_preds,
|
||||
"b_edges": b_edges, "b_preds": b_preds,
|
||||
})
|
||||
a_pred_total += a_preds
|
||||
b_pred_total += b_preds
|
||||
a_edge_total += a_edges
|
||||
b_edge_total += b_edges
|
||||
|
||||
print("-" * 100)
|
||||
n = len(selected)
|
||||
print(f"\nAggregate (n={n}):")
|
||||
print(f" Edges: A total={a_edge_total} mean={a_edge_total/n:.1f} B total={b_edge_total} mean={b_edge_total/n:.1f}")
|
||||
print(f" Predicates: A total={a_pred_total} mean={a_pred_total/n:.1f} B total={b_pred_total} mean={b_pred_total/n:.1f}")
|
||||
if a_pred_total > 0:
|
||||
print(f" Predicate delta: B vs A = {(b_pred_total-a_pred_total)/a_pred_total*100:+.1f}%")
|
||||
if a_edge_total > 0:
|
||||
print(f" Edge delta: B vs A = {(b_edge_total-a_edge_total)/a_edge_total*100:+.1f}%")
|
||||
|
||||
# Per-bucket
|
||||
print(f"\nPer-bucket:")
|
||||
for bucket in ["high", "mid", "low", "document"]:
|
||||
bucket_records = [r for r in records if r["bucket"] == bucket]
|
||||
if not bucket_records:
|
||||
continue
|
||||
bn = len(bucket_records)
|
||||
a_p = sum(r["a_preds"] for r in bucket_records)
|
||||
b_p = sum(r["b_preds"] for r in bucket_records)
|
||||
a_e = sum(r["a_edges"] for r in bucket_records)
|
||||
b_e = sum(r["b_edges"] for r in bucket_records)
|
||||
delta = ((b_p-a_p)/a_p*100) if a_p > 0 else 0
|
||||
print(f" [{bucket:>8}] n={bn} A.preds={a_p:>3} B.preds={b_p:>3} ({delta:+.0f}%) A.edges={a_e:>3} B.edges={b_e:>3}")
|
||||
|
||||
with open(EXPERIMENTS / "cascade_reextract_corrected_comparison.json", "w") as f:
|
||||
json.dump({"per_source": records,
|
||||
"aggregate": {"a_preds": a_pred_total, "b_preds": b_pred_total,
|
||||
"a_edges": a_edge_total, "b_edges": b_edge_total}}, f, indent=2)
|
||||
print(f"\nSaved to {EXPERIMENTS / 'cascade_reextract_corrected_comparison.json'}")
|
||||
@@ -0,0 +1,190 @@
|
||||
#!/usr/bin/env python3
|
||||
"""E1 orchestration — fetch source text, run Mistral metadata, submit to Graphiti test group_id."""
|
||||
import json
|
||||
import os
|
||||
import requests
|
||||
import subprocess
|
||||
import time
|
||||
from pathlib import Path
|
||||
import psycopg2
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
EXPERIMENTS = Path.home() / "aaronai" / "experiments"
|
||||
SAMPLE_FILE = EXPERIMENTS / "cascade_reextract_sample.json"
|
||||
RESULTS_FILE = EXPERIMENTS / "cascade_reextract_results.json"
|
||||
PG_DSN = os.environ["PG_DSN"]
|
||||
SIDECAR_URL = "http://localhost:8001"
|
||||
TEST_GROUP_ID = "aaron_cascade_test"
|
||||
MAX_DOC_CHARS = 12000 # Same cap as Tier 1 for parity
|
||||
|
||||
# Stage 2 metadata prompt — verbatim from stage-2-worker-spec.md
|
||||
METADATA_PROMPT = """You are a metadata extraction system. Given a document, produce structural and content metadata in strict JSON format.
|
||||
|
||||
Do not summarize the content beyond the one-sentence summary field. Do not extract entities or relationships. Do not interpret meaning. Produce only the metadata schema below.
|
||||
|
||||
Output JSON only. No prose, no explanation, no markdown code fences.
|
||||
|
||||
Schema:
|
||||
{
|
||||
"language": "<ISO 639-1 code>",
|
||||
"char_length": <integer>,
|
||||
"primary_format": "<prose|slides|code|structured|mixed>",
|
||||
"structural_signals": {
|
||||
"has_headings": <boolean>,
|
||||
"has_bullet_lists": <boolean>,
|
||||
"has_numbered_lists": <boolean>,
|
||||
"has_tables": <boolean>,
|
||||
"has_code_blocks": <boolean>,
|
||||
"has_dates": <boolean>
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": <boolean>,
|
||||
"has_institutional_language": <boolean>,
|
||||
"has_technical_terminology": <boolean>,
|
||||
"has_first_person": <boolean>,
|
||||
"has_quotations": <boolean>
|
||||
},
|
||||
"domain_class": "<technical|administrative|educational|personal|conversational>",
|
||||
"one_sentence_summary": "<one sentence describing what the document is about>"
|
||||
}
|
||||
|
||||
Document:
|
||||
"""
|
||||
|
||||
|
||||
def get_pg():
|
||||
return psycopg2.connect(PG_DSN)
|
||||
|
||||
|
||||
def fetch_source_text(source):
|
||||
"""Reassemble the full document from pgvector chunks, mirroring tier1_migration.py logic."""
|
||||
conn = get_pg()
|
||||
cur = conn.cursor()
|
||||
cur.execute("""
|
||||
SELECT STRING_AGG(document, E'\n\n' ORDER BY id) AS full_doc
|
||||
FROM embeddings WHERE source = %s
|
||||
""", (source,))
|
||||
row = cur.fetchone()
|
||||
conn.close()
|
||||
if row is None or row[0] is None:
|
||||
return None
|
||||
return row[0]
|
||||
|
||||
|
||||
def run_mistral_metadata(text):
|
||||
"""Call local Mistral via Ollama for base-class metadata."""
|
||||
truncated = text[:MAX_DOC_CHARS]
|
||||
prompt = METADATA_PROMPT + truncated
|
||||
response = requests.post(
|
||||
"http://localhost:11434/api/generate",
|
||||
json={"model": "mistral:latest", "prompt": prompt, "stream": False, "format": "json"},
|
||||
timeout=180,
|
||||
)
|
||||
response.raise_for_status()
|
||||
raw = response.json()["response"]
|
||||
try:
|
||||
metadata = json.loads(raw)
|
||||
# Override char_length with python-computed value (per stage-2-worker-spec)
|
||||
metadata["char_length"] = len(truncated)
|
||||
return metadata
|
||||
except json.JSONDecodeError:
|
||||
return {"error": "JSON parse failed", "raw": raw[:500]}
|
||||
|
||||
|
||||
def format_metadata_as_orientation(metadata):
|
||||
"""Format the base-class metadata as a source_description for Graphiti, with orient-not-bound framing."""
|
||||
if "error" in metadata:
|
||||
return f"tier1_cascade_test (metadata generation failed: {metadata['error']})"
|
||||
summary = metadata.get("one_sentence_summary", "")
|
||||
domain = metadata.get("domain_class", "unknown")
|
||||
fmt = metadata.get("primary_format", "unknown")
|
||||
return (
|
||||
f"This is a {domain} document in {fmt} format. "
|
||||
f"Summary: {summary} "
|
||||
f"This metadata is provided to orient your extraction, not to constrain it. "
|
||||
f"Extract entities and relationships freely from the document text itself; "
|
||||
f"the metadata is descriptive context, not a checklist."
|
||||
)
|
||||
|
||||
|
||||
def submit_episode(name, content, source_description):
|
||||
"""Submit episode to Graphiti sidecar at the test group_id."""
|
||||
payload = {
|
||||
"episodes": [{
|
||||
"name": name,
|
||||
"content": content[:MAX_DOC_CHARS],
|
||||
"source_description": source_description,
|
||||
"timestamp": "2026-04-28T00:00:00",
|
||||
}],
|
||||
"group_id": TEST_GROUP_ID,
|
||||
}
|
||||
response = requests.post(f"{SIDECAR_URL}/episodes/bulk", json=payload, timeout=300)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
|
||||
def main():
|
||||
with open(SAMPLE_FILE) as f:
|
||||
sample = json.load(f)
|
||||
selected = sample["selected"]
|
||||
print(f"E1 cascade re-extraction starting — {len(selected)} episodes to test group_id={TEST_GROUP_ID}\n")
|
||||
|
||||
results = []
|
||||
for i, ep in enumerate(selected, 1):
|
||||
name = ep["name"]
|
||||
bucket = ep["bucket"]
|
||||
print(f"[{i}/{len(selected)}] [{bucket}] {name}")
|
||||
record = {"name": name, "bucket": bucket, "tier1_entities": ep["entities"]}
|
||||
|
||||
# Fetch text
|
||||
print(f" Fetching source text...", end=" ", flush=True)
|
||||
text = fetch_source_text(name)
|
||||
if text is None:
|
||||
print("FAILED — no chunks in pgvector")
|
||||
record["error"] = "no source text"
|
||||
results.append(record)
|
||||
continue
|
||||
record["doc_chars"] = len(text)
|
||||
print(f"{len(text)} chars")
|
||||
|
||||
# Mistral metadata
|
||||
print(f" Generating Mistral metadata...", end=" ", flush=True)
|
||||
t0 = time.time()
|
||||
metadata = run_mistral_metadata(text)
|
||||
elapsed = time.time() - t0
|
||||
record["metadata"] = metadata
|
||||
record["metadata_elapsed_s"] = round(elapsed, 1)
|
||||
if "error" in metadata:
|
||||
print(f"FAILED in {elapsed:.1f}s")
|
||||
else:
|
||||
print(f"{elapsed:.1f}s — domain={metadata.get('domain_class')}, format={metadata.get('primary_format')}")
|
||||
|
||||
# Submit to Graphiti
|
||||
source_desc = format_metadata_as_orientation(metadata)
|
||||
record["source_description"] = source_desc
|
||||
print(f" Submitting to Graphiti test group...", end=" ", flush=True)
|
||||
t0 = time.time()
|
||||
try:
|
||||
result = submit_episode(name, text, source_desc)
|
||||
elapsed = time.time() - t0
|
||||
print(f"{elapsed:.1f}s — OK")
|
||||
record["submit_elapsed_s"] = round(elapsed, 1)
|
||||
record["submit_result"] = result
|
||||
except Exception as e:
|
||||
elapsed = time.time() - t0
|
||||
print(f"{elapsed:.1f}s — FAILED: {e}")
|
||||
record["submit_error"] = str(e)
|
||||
|
||||
results.append(record)
|
||||
# Save intermediate state after each episode
|
||||
with open(RESULTS_FILE, "w") as f:
|
||||
json.dump({"results": results}, f, indent=2, default=str)
|
||||
print()
|
||||
|
||||
print(f"\nDone. Results saved to {RESULTS_FILE}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,181 @@
|
||||
#!/usr/bin/env python3
|
||||
"""E1 corrected re-run — cascade orientation passed via custom_extraction_instructions."""
|
||||
import json
|
||||
import os
|
||||
import requests
|
||||
import time
|
||||
from pathlib import Path
|
||||
import psycopg2
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
EXPERIMENTS = Path.home() / "aaronai" / "experiments"
|
||||
SAMPLE_FILE = EXPERIMENTS / "cascade_reextract_sample.json"
|
||||
RESULTS_FILE = EXPERIMENTS / "cascade_reextract_results.json"
|
||||
PG_DSN = os.environ["PG_DSN"]
|
||||
SIDECAR_URL = "http://localhost:8001"
|
||||
TEST_GROUP_ID = "aaron_cascade_test"
|
||||
MAX_DOC_CHARS = 12000
|
||||
|
||||
METADATA_PROMPT = """You are a metadata extraction system. Given a document, produce structural and content metadata in strict JSON format.
|
||||
|
||||
Do not summarize the content beyond the one-sentence summary field. Do not extract entities or relationships. Do not interpret meaning. Produce only the metadata schema below.
|
||||
|
||||
Output JSON only. No prose, no explanation, no markdown code fences.
|
||||
|
||||
Schema:
|
||||
{
|
||||
"language": "<ISO 639-1 code>",
|
||||
"char_length": <integer>,
|
||||
"primary_format": "<prose|slides|code|structured|mixed>",
|
||||
"structural_signals": {
|
||||
"has_headings": <boolean>,
|
||||
"has_bullet_lists": <boolean>,
|
||||
"has_numbered_lists": <boolean>,
|
||||
"has_tables": <boolean>,
|
||||
"has_code_blocks": <boolean>,
|
||||
"has_dates": <boolean>
|
||||
},
|
||||
"content_signals": {
|
||||
"has_named_people": <boolean>,
|
||||
"has_institutional_language": <boolean>,
|
||||
"has_technical_terminology": <boolean>,
|
||||
"has_first_person": <boolean>,
|
||||
"has_quotations": <boolean>
|
||||
},
|
||||
"domain_class": "<technical|administrative|educational|personal|conversational>",
|
||||
"one_sentence_summary": "<one sentence describing what the document is about>"
|
||||
}
|
||||
|
||||
Document:
|
||||
"""
|
||||
|
||||
|
||||
def get_pg():
|
||||
return psycopg2.connect(PG_DSN)
|
||||
|
||||
|
||||
def fetch_source_text(source):
|
||||
conn = get_pg()
|
||||
cur = conn.cursor()
|
||||
cur.execute("""
|
||||
SELECT STRING_AGG(document, E'\n\n' ORDER BY id) AS full_doc
|
||||
FROM embeddings WHERE source = %s
|
||||
""", (source,))
|
||||
row = cur.fetchone()
|
||||
conn.close()
|
||||
if row is None or row[0] is None:
|
||||
return None
|
||||
return row[0]
|
||||
|
||||
|
||||
def run_mistral_metadata(text):
|
||||
truncated = text[:MAX_DOC_CHARS]
|
||||
prompt = METADATA_PROMPT + truncated
|
||||
response = requests.post(
|
||||
"http://localhost:11434/api/generate",
|
||||
json={"model": "mistral:latest", "prompt": prompt, "stream": False, "format": "json"},
|
||||
timeout=180,
|
||||
)
|
||||
response.raise_for_status()
|
||||
raw = response.json()["response"]
|
||||
try:
|
||||
metadata = json.loads(raw)
|
||||
metadata["char_length"] = len(truncated)
|
||||
return metadata
|
||||
except json.JSONDecodeError:
|
||||
return {"error": "JSON parse failed", "raw": raw[:500]}
|
||||
|
||||
|
||||
def format_metadata_as_orientation(metadata):
|
||||
"""Format metadata as orient-not-bound extraction instructions."""
|
||||
if "error" in metadata:
|
||||
return None
|
||||
summary = metadata.get("one_sentence_summary", "")
|
||||
domain = metadata.get("domain_class", "unknown")
|
||||
fmt = metadata.get("primary_format", "unknown")
|
||||
return (
|
||||
f"This is a {domain} document in {fmt} format. "
|
||||
f"Summary: {summary} "
|
||||
f"This metadata is provided to orient your extraction, not to constrain it. "
|
||||
f"Extract entities and relationships freely from the document text itself; "
|
||||
f"the metadata is descriptive context, not a checklist."
|
||||
)
|
||||
|
||||
|
||||
def submit_episode_singular(name, content, custom_instructions):
|
||||
"""Submit episode to Graphiti's singular /episodes endpoint with cascade orientation."""
|
||||
payload = {
|
||||
"name": name,
|
||||
"content": content[:MAX_DOC_CHARS],
|
||||
"source_description": "e1_corrected_run", # neutral label, not the cascade text
|
||||
"timestamp": "2026-04-28T00:00:00",
|
||||
"group_id": TEST_GROUP_ID,
|
||||
"custom_extraction_instructions": custom_instructions,
|
||||
}
|
||||
response = requests.post(f"{SIDECAR_URL}/episodes", json=payload, timeout=300)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
|
||||
def main():
|
||||
with open(SAMPLE_FILE) as f:
|
||||
sample = json.load(f)
|
||||
selected = sample["selected"]
|
||||
print(f"E1 CORRECTED re-run — {len(selected)} episodes via /episodes (singular)")
|
||||
print(f"Cascade orientation passed in custom_extraction_instructions.\n")
|
||||
|
||||
results = []
|
||||
for i, ep in enumerate(selected, 1):
|
||||
name = ep["name"]
|
||||
bucket = ep["bucket"]
|
||||
print(f"[{i}/{len(selected)}] [{bucket}] {name}")
|
||||
record = {"name": name, "bucket": bucket, "tier1_entities": ep["entities"]}
|
||||
|
||||
print(f" Fetching source text...", end=" ", flush=True)
|
||||
text = fetch_source_text(name)
|
||||
if text is None:
|
||||
print("FAILED — no chunks in pgvector")
|
||||
record["error"] = "no source text"
|
||||
results.append(record)
|
||||
continue
|
||||
record["doc_chars"] = len(text)
|
||||
print(f"{len(text)} chars")
|
||||
|
||||
print(f" Generating Mistral metadata...", end=" ", flush=True)
|
||||
t0 = time.time()
|
||||
metadata = run_mistral_metadata(text)
|
||||
elapsed = time.time() - t0
|
||||
record["metadata"] = metadata
|
||||
record["metadata_elapsed_s"] = round(elapsed, 1)
|
||||
if "error" in metadata:
|
||||
print(f"FAILED in {elapsed:.1f}s")
|
||||
else:
|
||||
print(f"{elapsed:.1f}s — domain={metadata.get('domain_class')}, format={metadata.get('primary_format')}")
|
||||
|
||||
custom_instructions = format_metadata_as_orientation(metadata)
|
||||
record["custom_extraction_instructions"] = custom_instructions
|
||||
print(f" Submitting via /episodes (singular) with custom_extraction_instructions...", end=" ", flush=True)
|
||||
t0 = time.time()
|
||||
try:
|
||||
result = submit_episode_singular(name, text, custom_instructions)
|
||||
elapsed = time.time() - t0
|
||||
print(f"{elapsed:.1f}s — OK")
|
||||
record["submit_elapsed_s"] = round(elapsed, 1)
|
||||
record["submit_result"] = result
|
||||
except Exception as e:
|
||||
elapsed = time.time() - t0
|
||||
print(f"{elapsed:.1f}s — FAILED: {e}")
|
||||
record["submit_error"] = str(e)
|
||||
|
||||
results.append(record)
|
||||
with open(RESULTS_FILE, "w") as f:
|
||||
json.dump({"results": results}, f, indent=2, default=str)
|
||||
print()
|
||||
|
||||
print(f"\nDone. Results saved to {RESULTS_FILE}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,116 @@
|
||||
#!/usr/bin/env python3
|
||||
"""E1 sample selection — pick 10 episodes from Tier 1 stratified by density and type."""
|
||||
import json
|
||||
import os
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
from collections import defaultdict
|
||||
|
||||
EXPERIMENTS = Path.home() / "aaronai" / "experiments"
|
||||
OUTPUT = EXPERIMENTS / "cascade_reextract_sample.json"
|
||||
|
||||
# Get all Tier 1 episodes with their entity counts via FalkorDB
|
||||
def query_episode_counts():
|
||||
query = ("MATCH (e:Episodic) OPTIONAL MATCH (e)-[r]-(n:Entity) "
|
||||
"RETURN e.name AS name, count(distinct n) AS entities "
|
||||
"ORDER BY entities DESC")
|
||||
result = subprocess.run(
|
||||
["docker", "exec", "falkordb", "redis-cli", "GRAPH.QUERY", "aaron", query],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
# Parse the output — redis-cli returns rows after a header
|
||||
lines = [l for l in result.stdout.split("\n") if l.strip()]
|
||||
episodes = []
|
||||
# Skip header rows ("name", "entities") and timing rows
|
||||
i = 0
|
||||
while i < len(lines):
|
||||
if lines[i] == "name":
|
||||
i += 2 # skip "name" and "entities" headers
|
||||
continue
|
||||
if lines[i].startswith("Cached") or lines[i].startswith("Query"):
|
||||
break
|
||||
# Each episode: name on one line, count on next
|
||||
if i + 1 < len(lines):
|
||||
try:
|
||||
count = int(lines[i + 1])
|
||||
episodes.append({"name": lines[i], "entities": count})
|
||||
i += 2
|
||||
except ValueError:
|
||||
i += 1
|
||||
else:
|
||||
i += 1
|
||||
return episodes
|
||||
|
||||
print("Fetching episode entity counts from FalkorDB...")
|
||||
episodes = query_episode_counts()
|
||||
print(f"Got {len(episodes)} episodes")
|
||||
|
||||
# Classify by density bucket and type
|
||||
def is_document(name):
|
||||
doc_extensions = (".pdf", ".docx", ".pptx", ".txt", ".md")
|
||||
return any(name.lower().endswith(ext) for ext in doc_extensions)
|
||||
|
||||
# Compute quartile boundaries from the entity counts
|
||||
counts = sorted([e["entities"] for e in episodes], reverse=True)
|
||||
n = len(counts)
|
||||
top_q = counts[n // 4] # 25th percentile from top
|
||||
bottom_q = counts[3 * n // 4] # 75th percentile from top
|
||||
|
||||
print(f"\nQuartile boundaries: top={top_q}+, middle=({bottom_q+1}-{top_q-1}), bottom=0-{bottom_q}")
|
||||
|
||||
high = [e for e in episodes if e["entities"] >= top_q and not is_document(e["name"])]
|
||||
mid = [e for e in episodes if bottom_q < e["entities"] < top_q and not is_document(e["name"])]
|
||||
low = [e for e in episodes if e["entities"] <= bottom_q and not is_document(e["name"])]
|
||||
docs = [e for e in episodes if is_document(e["name"]) and e["entities"] >= 5]
|
||||
|
||||
print(f"High-density conversations: {len(high)}")
|
||||
print(f"Mid-density conversations: {len(mid)}")
|
||||
print(f"Low-density conversations: {len(low)}")
|
||||
print(f"Documents (≥5 entities): {len(docs)}")
|
||||
|
||||
# Deterministic selection — take from middle of each bucket to avoid edge cases
|
||||
def pick(bucket, n):
|
||||
if len(bucket) < n:
|
||||
return bucket
|
||||
mid_idx = len(bucket) // 2
|
||||
start = max(0, mid_idx - n // 2)
|
||||
return bucket[start:start + n]
|
||||
|
||||
selected = (
|
||||
pick(high, 3) +
|
||||
pick(mid, 3) +
|
||||
pick(low, 2) +
|
||||
pick(docs, 2)
|
||||
)
|
||||
|
||||
# Tag each with its bucket
|
||||
def bucket_for(ep):
|
||||
if is_document(ep["name"]):
|
||||
return "document"
|
||||
if ep["entities"] >= top_q:
|
||||
return "high"
|
||||
if ep["entities"] > bottom_q:
|
||||
return "mid"
|
||||
return "low"
|
||||
|
||||
for ep in selected:
|
||||
ep["bucket"] = bucket_for(ep)
|
||||
|
||||
print(f"\nSelected {len(selected)} episodes for E1:")
|
||||
for ep in selected:
|
||||
print(f" [{ep['bucket']:>8}] {ep['entities']:>3}e {ep['name']}")
|
||||
|
||||
# Save selection
|
||||
with open(OUTPUT, "w") as f:
|
||||
json.dump({
|
||||
"metadata": {
|
||||
"purpose": "E1 cascade re-extraction sample (n=10)",
|
||||
"stratification": "density buckets + document subset",
|
||||
"quartile_top": top_q,
|
||||
"quartile_bottom": bottom_q,
|
||||
"total_tier1_episodes": len(episodes),
|
||||
},
|
||||
"selected": selected,
|
||||
}, f, indent=2)
|
||||
|
||||
print(f"\nSaved to {OUTPUT}")
|
||||
@@ -0,0 +1,24 @@
|
||||
#!/usr/bin/env python3
|
||||
"""E2 follow-up: confirm Aaron AI alias situation, find other potential duplicates."""
|
||||
import subprocess
|
||||
|
||||
QUERIES = [
|
||||
("Aaron AI variants",
|
||||
"MATCH (n:Entity) WHERE n.name CONTAINS 'Aaron AI' OR n.name CONTAINS 'ARIN' OR n.name CONTAINS 'RNAI' RETURN n.name, n.summary"),
|
||||
("All Mossygear-named entities",
|
||||
"MATCH (n:Entity) WHERE n.name CONTAINS 'Mossy' OR n.name CONTAINS 'A+K' OR n.name CONTAINS 'AK Design' RETURN n.name, n.summary"),
|
||||
("Total entity count check",
|
||||
"MATCH (n:Entity) RETURN count(n) as total"),
|
||||
("Top 30 entity names by edge count",
|
||||
"MATCH (n:Entity)-[r]-() RETURN n.name, count(r) as edges ORDER BY edges DESC LIMIT 30"),
|
||||
]
|
||||
|
||||
for label, query in QUERIES:
|
||||
print(f"\n{'=' * 60}")
|
||||
print(f"QUERY: {label}")
|
||||
print('=' * 60)
|
||||
result = subprocess.run(
|
||||
["docker", "exec", "falkordb", "redis-cli", "GRAPH.QUERY", "aaron", query],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
print(result.stdout)
|
||||
@@ -0,0 +1,20 @@
|
||||
#!/usr/bin/env python3
|
||||
"""E2: Entity resolution diagnostic. Queries Graphiti's FalkorDB for the six test entities."""
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
TEST_ENTITIES = ["Aaron", "Kat", "HVAMC", "Bird", "Susan Hamlet", "Tulsa album"]
|
||||
|
||||
def run_cypher(query):
|
||||
result = subprocess.run(
|
||||
["docker", "exec", "falkordb", "redis-cli", "GRAPH.QUERY", "aaron", query],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
return result.stdout
|
||||
|
||||
for name in TEST_ENTITIES:
|
||||
print(f"\n{'=' * 60}")
|
||||
print(f"ENTITY: {name}")
|
||||
print('=' * 60)
|
||||
query = f"MATCH (n:Entity) WHERE n.name CONTAINS '{name}' RETURN n.name, n.summary"
|
||||
print(run_cypher(query))
|
||||
@@ -0,0 +1,24 @@
|
||||
#!/usr/bin/env python3
|
||||
"""E2 follow-up: how many distinct episodes connect to each entity?"""
|
||||
import subprocess
|
||||
|
||||
QUERIES = [
|
||||
("Aaron", "MATCH (n:Entity {name: 'Aaron'})-[]-(e:Episodic) RETURN DISTINCT e.name LIMIT 30"),
|
||||
("Nelson", "MATCH (n:Entity {name: 'Nelson'})-[]-(e:Episodic) RETURN DISTINCT e.name LIMIT 30"),
|
||||
("HVAMC", "MATCH (n:Entity {name: 'HVAMC'})-[]-(e:Episodic) RETURN DISTINCT e.name LIMIT 30"),
|
||||
("Bird", "MATCH (n:Entity {name: 'Bird'})-[]-(e:Episodic) RETURN DISTINCT e.name LIMIT 30"),
|
||||
("Tulsa album", "MATCH (n:Entity {name: 'Tulsa album'})-[]-(e:Episodic) RETURN DISTINCT e.name LIMIT 30"),
|
||||
("Susan Hamlet", "MATCH (n:Entity {name: 'Susan Hamlet'})-[]-(e:Episodic) RETURN DISTINCT e.name LIMIT 30"),
|
||||
("Kat", "MATCH (n:Entity {name: 'Kat'})-[]-(e:Episodic) RETURN DISTINCT e.name LIMIT 30"),
|
||||
("Katherine Wilson","MATCH (n:Entity {name: 'Katherine Wilson'})-[]-(e:Episodic) RETURN DISTINCT e.name LIMIT 30"),
|
||||
]
|
||||
|
||||
for label, query in QUERIES:
|
||||
print(f"\n{'=' * 60}")
|
||||
print(f"ENTITY: {label}")
|
||||
print('=' * 60)
|
||||
result = subprocess.run(
|
||||
["docker", "exec", "falkordb", "redis-cli", "GRAPH.QUERY", "aaron", query],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
print(result.stdout)
|
||||
@@ -0,0 +1,304 @@
|
||||
"""Backfill embeddings.type and embeddings.created_at (Improvement #2 / A.3).
|
||||
|
||||
Idempotent on cohort predicates (every WHERE clause includes IS NULL on the
|
||||
target column). Writes provenance to metadata.type_source and metadata.created_at_source
|
||||
so each row is auditable and revertable per-source. Default --dry-run=True.
|
||||
|
||||
Order of batches:
|
||||
T1. type backfill: WHERE type IS NULL -> 'document' (extension-classified, all hit).
|
||||
C1. created_at: WHERE ca IS NULL AND metadata.filepath stat-resolves -> filesystem mtime.
|
||||
C2. created_at: WHERE ca IS NULL AND source has unique watcher_state path -> watcher mtime.
|
||||
C3. created_at: WHERE ca IS NULL AND source has watcher_state collision -> most-recent mtime.
|
||||
C4. created_at: WHERE type='chatgpt_conversation' AND ca IS NULL -> export-resolved create_time.
|
||||
C5. created_at: WHERE ca IS NULL (residual) -> sentinel.
|
||||
|
||||
Snapshot table embeddings_backup_2026_05_03 must exist before --apply.
|
||||
|
||||
Usage:
|
||||
venv/bin/python3 scripts/experiments/embeddings_backfill_apply.py # dry-run
|
||||
venv/bin/python3 scripts/experiments/embeddings_backfill_apply.py --apply # write
|
||||
|
||||
Exits non-zero if snapshot is missing on --apply.
|
||||
"""
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
from collections import Counter, defaultdict
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import psycopg2
|
||||
from psycopg2.extras import RealDictCursor, Json
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
WATCHER_STATE = Path.home() / "aaronai" / "watcher_state.json"
|
||||
CHATGPT_EXPORT_DIR = Path("/home/aaron/nextcloud/data/data/aaron/files/Archive/Misc/ChatGPT Export")
|
||||
SNAPSHOT_TABLE = "embeddings_backup_2026_05_03"
|
||||
SENTINEL_ISO = "2026-04-26T00:00:00Z"
|
||||
|
||||
|
||||
# ─── Helpers ────────────────────────────────────────────────────────────────
|
||||
|
||||
def get_pg():
|
||||
return psycopg2.connect(PG_DSN, cursor_factory=RealDictCursor)
|
||||
|
||||
|
||||
def header(t):
|
||||
bar = "=" * 70
|
||||
print(f"\n{bar}\n{t}\n{bar}")
|
||||
|
||||
|
||||
def fmt_ts_unix(ts):
|
||||
return datetime.fromtimestamp(float(ts), tz=timezone.utc).isoformat().replace("+00:00", "Z")
|
||||
|
||||
|
||||
def fmt_ts_mtime(p):
|
||||
try:
|
||||
return datetime.fromtimestamp(p.stat().st_mtime, tz=timezone.utc).isoformat().replace("+00:00", "Z")
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def load_watcher_state():
|
||||
state = json.loads(WATCHER_STATE.read_text())
|
||||
by_name = defaultdict(list)
|
||||
for path, mtime in state.items():
|
||||
by_name[Path(path).name].append((path, mtime))
|
||||
return by_name
|
||||
|
||||
|
||||
def load_chatgpt_index():
|
||||
if not CHATGPT_EXPORT_DIR.exists():
|
||||
return {}
|
||||
index = {}
|
||||
for f in sorted(CHATGPT_EXPORT_DIR.glob("conversations*.json")):
|
||||
try:
|
||||
data = json.loads(f.read_text(encoding="utf-8"))
|
||||
except Exception:
|
||||
continue
|
||||
for convo in data:
|
||||
cid = convo.get("id") or convo.get("conversation_id")
|
||||
ct = convo.get("create_time")
|
||||
if cid and ct is not None:
|
||||
index[cid] = ct
|
||||
return index
|
||||
|
||||
|
||||
def assert_snapshot(cur):
|
||||
cur.execute("SELECT to_regclass(%s) AS t;", (SNAPSHOT_TABLE,))
|
||||
if cur.fetchone()["t"] is None:
|
||||
print(f"ERROR: snapshot table '{SNAPSHOT_TABLE}' not found. Run A.2 first.")
|
||||
sys.exit(2)
|
||||
cur.execute(f"SELECT COUNT(*) AS n FROM {SNAPSHOT_TABLE};")
|
||||
snap = cur.fetchone()["n"]
|
||||
cur.execute("SELECT COUNT(*) AS n FROM embeddings;")
|
||||
live = cur.fetchone()["n"]
|
||||
print(f"snapshot {SNAPSHOT_TABLE}: {snap} rows; live embeddings: {live} rows")
|
||||
if snap != live:
|
||||
print(f"ERROR: snapshot row count != live ({snap} vs {live}). Refresh snapshot before --apply.")
|
||||
sys.exit(2)
|
||||
|
||||
|
||||
# ─── Batch primitive ────────────────────────────────────────────────────────
|
||||
|
||||
def run_batch(cur, label, candidates, apply_mode):
|
||||
"""candidates: list of (id, set_type, set_ca, type_source, ca_source).
|
||||
set_type / set_ca may be None to leave that column alone.
|
||||
In dry-run we still execute UPDATEs inside an outer transaction (rolled back
|
||||
at the end) so subsequent batches' SELECTs see the correct intermediate state."""
|
||||
n = len(candidates)
|
||||
print(f" {label}: {n} rows queued")
|
||||
if n == 0:
|
||||
return 0
|
||||
for c in candidates[:3]:
|
||||
print(f" sample: id={c[0]} type={c[1]!r} ca={c[2]!r} type_src={c[3]} ca_src={c[4]}")
|
||||
n_written = 0
|
||||
for row_id, set_type, set_ca, type_src, ca_src in candidates:
|
||||
meta_patch = {}
|
||||
if type_src:
|
||||
meta_patch["type_source"] = type_src
|
||||
if ca_src:
|
||||
meta_patch["created_at_source"] = ca_src
|
||||
# Build set list dynamically.
|
||||
sets, params = [], []
|
||||
if set_type is not None:
|
||||
sets.append("type = %s")
|
||||
params.append(set_type)
|
||||
if set_ca is not None:
|
||||
sets.append("created_at = %s")
|
||||
params.append(set_ca)
|
||||
if meta_patch:
|
||||
sets.append("metadata = COALESCE(metadata, '{}'::jsonb) || %s::jsonb")
|
||||
params.append(json.dumps(meta_patch))
|
||||
params.append(row_id)
|
||||
cur.execute(f"UPDATE embeddings SET {', '.join(sets)} WHERE id = %s;", params)
|
||||
n_written += cur.rowcount
|
||||
print(f" {n_written} rows updated{' (will rollback)' if not apply_mode else ''}")
|
||||
return n_written
|
||||
|
||||
|
||||
# ─── Batches ────────────────────────────────────────────────────────────────
|
||||
|
||||
def batch_T1_type(cur, apply_mode):
|
||||
"""type IS NULL -> 'document'. All cohort A rows have a SUPPORTED extension."""
|
||||
cur.execute("""
|
||||
SELECT id, source FROM embeddings WHERE type IS NULL ORDER BY id;
|
||||
""")
|
||||
rows = cur.fetchall()
|
||||
cands = [(r["id"], "document", None, "inferred_extension", None) for r in rows]
|
||||
return run_batch(cur, "T1 type IS NULL -> 'document'", cands, apply_mode)
|
||||
|
||||
|
||||
def batch_C1_filepath_stat(cur, apply_mode):
|
||||
"""ca IS NULL AND metadata.filepath stat-resolves -> mtime."""
|
||||
cur.execute("""
|
||||
SELECT id, source, metadata->>'filepath' AS fp
|
||||
FROM embeddings
|
||||
WHERE created_at IS NULL AND metadata->>'filepath' IS NOT NULL
|
||||
ORDER BY id;
|
||||
""")
|
||||
rows = cur.fetchall()
|
||||
cands, n_skipped_missing = [], 0
|
||||
for r in rows:
|
||||
p = Path(r["fp"])
|
||||
if p.exists():
|
||||
mt = fmt_ts_mtime(p)
|
||||
if mt:
|
||||
cands.append((r["id"], None, mt, None, "filepath_stat"))
|
||||
continue
|
||||
n_skipped_missing += 1
|
||||
print(f" C1 candidates: {len(cands)} (skipped {n_skipped_missing} where filepath gone or unstattable)")
|
||||
return run_batch(cur, "C1 ca IS NULL AND filepath stat-resolves -> mtime", cands, apply_mode)
|
||||
|
||||
|
||||
def batch_C2_C3_watcher_state(cur, apply_mode):
|
||||
"""ca IS NULL AND filepath unresolvable -> watcher_state by source basename.
|
||||
C2 = unique hit, C3 = collision pick-latest."""
|
||||
by_name = load_watcher_state()
|
||||
cur.execute("""
|
||||
SELECT id, source, metadata->>'filepath' AS fp
|
||||
FROM embeddings
|
||||
WHERE created_at IS NULL
|
||||
ORDER BY id;
|
||||
""")
|
||||
rows = cur.fetchall()
|
||||
c2, c3 = [], []
|
||||
skipped_no_match = 0
|
||||
for r in rows:
|
||||
# skip rows already targeted by C1 path
|
||||
if r["fp"] and Path(r["fp"]).exists():
|
||||
continue
|
||||
src = r["source"]
|
||||
if not src or src not in by_name:
|
||||
skipped_no_match += 1
|
||||
continue
|
||||
candidates = by_name[src]
|
||||
if len(candidates) == 1:
|
||||
mt = fmt_ts_unix(candidates[0][1])
|
||||
c2.append((r["id"], None, mt, None, "watcher_state_unique"))
|
||||
else:
|
||||
latest = max(candidates, key=lambda x: float(x[1]))
|
||||
mt = fmt_ts_unix(latest[1])
|
||||
c3.append((r["id"], None, mt, None, f"watcher_state_collision_pick_latest_of_{len(candidates)}"))
|
||||
print(f" C2/C3 source-basename fallback: {len(c2)} unique, {len(c3)} collision, "
|
||||
f"{skipped_no_match} unmatched (will fall to C4/C5)")
|
||||
n2 = run_batch(cur, "C2 ca IS NULL AND watcher_state unique -> mtime", c2, apply_mode)
|
||||
n3 = run_batch(cur, "C3 ca IS NULL AND watcher_state collision -> latest mtime", c3, apply_mode)
|
||||
return n2 + n3
|
||||
|
||||
|
||||
def batch_C4_chatgpt_export(cur, apply_mode):
|
||||
index = load_chatgpt_index()
|
||||
cur.execute("""
|
||||
SELECT id, source FROM embeddings
|
||||
WHERE type='chatgpt_conversation' AND created_at IS NULL ORDER BY id;
|
||||
""")
|
||||
rows = cur.fetchall()
|
||||
cands, unresolved = [], 0
|
||||
for r in rows:
|
||||
m = re.match(r"^chatgpt_(.+)_(\d+)$", r["id"])
|
||||
cid = m.group(1) if m else None
|
||||
ct = index.get(cid)
|
||||
if ct is None:
|
||||
unresolved += 1
|
||||
continue
|
||||
ct_iso = datetime.fromtimestamp(float(ct), tz=timezone.utc).isoformat().replace("+00:00", "Z")
|
||||
cands.append((r["id"], None, ct_iso, None, "chatgpt_export"))
|
||||
print(f" C4 chatgpt export resolution: {len(cands)} resolved, {unresolved} unresolved (fall to C5)")
|
||||
return run_batch(cur, "C4 type='chatgpt_conversation' AND ca IS NULL -> export create_time", cands, apply_mode)
|
||||
|
||||
|
||||
def batch_C5_sentinel(cur, apply_mode):
|
||||
cur.execute("""
|
||||
SELECT id, type, source FROM embeddings WHERE created_at IS NULL ORDER BY id;
|
||||
""")
|
||||
rows = cur.fetchall()
|
||||
cands = [(r["id"], None, SENTINEL_ISO, None, "sentinel") for r in rows]
|
||||
if cands:
|
||||
sample_types = Counter(r["type"] for r in rows)
|
||||
print(f" C5 residual sentinel rows by type: {dict(sample_types)}")
|
||||
return run_batch(cur, f"C5 ca IS NULL residual -> sentinel {SENTINEL_ISO}", cands, apply_mode)
|
||||
|
||||
|
||||
# ─── Pre/post counts ────────────────────────────────────────────────────────
|
||||
|
||||
def print_counts(cur, label):
|
||||
cur.execute("""
|
||||
SELECT
|
||||
COUNT(*) AS total,
|
||||
COUNT(*) FILTER (WHERE type IS NULL) AS type_null,
|
||||
COUNT(*) FILTER (WHERE created_at IS NULL) AS ca_null
|
||||
FROM embeddings;
|
||||
""")
|
||||
r = cur.fetchone()
|
||||
print(f" [{label}] total={r['total']} type_null={r['type_null']} ca_null={r['ca_null']}")
|
||||
|
||||
|
||||
# ─── Driver ─────────────────────────────────────────────────────────────────
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--apply", action="store_true", help="default false (dry-run)")
|
||||
args = ap.parse_args()
|
||||
apply_mode = args.apply
|
||||
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
|
||||
print(f"Mode: {'APPLY (writes will commit)' if apply_mode else 'DRY-RUN (no writes)'}")
|
||||
print(f"Sentinel: {SENTINEL_ISO}")
|
||||
|
||||
if apply_mode:
|
||||
assert_snapshot(cur)
|
||||
|
||||
header("PRE-COUNTS")
|
||||
print_counts(cur, "before")
|
||||
|
||||
header("BATCHES")
|
||||
n_t1 = batch_T1_type(cur, apply_mode)
|
||||
n_c1 = batch_C1_filepath_stat(cur, apply_mode)
|
||||
n_c2c3 = batch_C2_C3_watcher_state(cur, apply_mode)
|
||||
n_c4 = batch_C4_chatgpt_export(cur, apply_mode)
|
||||
n_c5 = batch_C5_sentinel(cur, apply_mode)
|
||||
|
||||
header("POST-COUNTS")
|
||||
print_counts(cur, "after" if apply_mode else "after (in-transaction, will rollback)")
|
||||
|
||||
if apply_mode:
|
||||
pg.commit()
|
||||
print("\nCOMMITTED.")
|
||||
else:
|
||||
pg.rollback()
|
||||
print("\nROLLED BACK (dry-run).")
|
||||
|
||||
print(f"\nSummary: T1={n_t1} C1={n_c1} C2+C3={n_c2c3} C4={n_c4} C5={n_c5}")
|
||||
pg.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,557 @@
|
||||
"""Read-only inspection for the embeddings.type / embeddings.created_at backfill (Improvement #2 / A.1).
|
||||
|
||||
Produces a survey of every backfill source-of-truth question without writing
|
||||
to the database. Output is a human-readable report on stdout plus a JSON
|
||||
sidecar at experiments/embeddings_backfill_inspection_<date>.json.
|
||||
|
||||
Sections:
|
||||
1. Cohort recap (counts; should match prior investigation).
|
||||
2. Cohort A type inference: extension classifier coverage.
|
||||
3. created_at inference for cohort A + B-doc-old:
|
||||
- rows with metadata.filepath: stat the file, check existence.
|
||||
- rows without filepath: lookup source against watcher_state.json.
|
||||
- filename-collision shape audit (live+backup, live+archive, ambiguous).
|
||||
4. ChatGPT export resolution (Plan A.1 addition #1):
|
||||
- existence of /home/aaron/nextcloud/.../ChatGPT Export/.
|
||||
- sample 5 B-chatgpt rows; resolve convo_id -> create_time.
|
||||
5. Sentinel date discovery (Plan A.1 addition #3):
|
||||
- earliest non-NULL created_at per type (already-populated rows are the
|
||||
lower bound for when the substrate started carrying timestamps).
|
||||
- git log for the pgvector migration commit.
|
||||
- any ChromaDB sqlite still on disk.
|
||||
- propose a sentinel with reasoning, or flag as arbitrary.
|
||||
6. 50-row stratified sample: derived (type, created_at, source) per row.
|
||||
|
||||
Usage: venv/bin/python3 scripts/experiments/embeddings_backfill_inspection.py
|
||||
|
||||
Read-only. No DB writes. No filesystem writes outside experiments/.
|
||||
"""
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
from collections import Counter, defaultdict
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import psycopg2
|
||||
from psycopg2.extras import RealDictCursor
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
WATCHER_STATE = Path.home() / "aaronai" / "watcher_state.json"
|
||||
CHATGPT_EXPORT_DIR = Path("/home/aaron/nextcloud/data/data/aaron/files/Archive/Misc/ChatGPT Export")
|
||||
NEXTCLOUD_ROOT = Path("/home/aaron/nextcloud/data/data/aaron/files")
|
||||
OUT_PATH = Path.home() / "aaronai" / "experiments" / f"embeddings_backfill_inspection_{datetime.now().strftime('%Y-%m-%d')}.json"
|
||||
|
||||
SUPPORTED_EXT = {".pdf", ".docx", ".pptx", ".txt", ".md"}
|
||||
|
||||
random.seed(20260503)
|
||||
|
||||
|
||||
# ─── Helpers ────────────────────────────────────────────────────────────────
|
||||
|
||||
def get_pg():
|
||||
return psycopg2.connect(PG_DSN, cursor_factory=RealDictCursor)
|
||||
|
||||
|
||||
def header(title):
|
||||
bar = "=" * 70
|
||||
print(f"\n{bar}\n{title}\n{bar}")
|
||||
|
||||
|
||||
def sub(title):
|
||||
print(f"\n--- {title} ---")
|
||||
|
||||
|
||||
def fmt_ts_from_unix(ts):
|
||||
"""Watcher state stores unix timestamps as strings."""
|
||||
try:
|
||||
return datetime.fromtimestamp(float(ts), tz=timezone.utc).isoformat().replace("+00:00", "Z")
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def fmt_ts_from_st_mtime(p):
|
||||
try:
|
||||
return datetime.fromtimestamp(p.stat().st_mtime, tz=timezone.utc).isoformat().replace("+00:00", "Z")
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def load_watcher_state():
|
||||
"""Returns (path -> mtime_str), and (basename -> [(path, mtime_str), ...])."""
|
||||
state = json.loads(WATCHER_STATE.read_text())
|
||||
by_path = state
|
||||
by_name = defaultdict(list)
|
||||
for path, mtime in state.items():
|
||||
by_name[Path(path).name].append((path, mtime))
|
||||
return by_path, by_name
|
||||
|
||||
|
||||
def classify_collision_shape(paths):
|
||||
"""Categorize a filename-collision group:
|
||||
- 'live+backup' : exactly one path doesn't contain backup/.bak markers
|
||||
and others do
|
||||
- 'live+archive' : exactly one is outside Archive/ and others are inside
|
||||
- 'multi-live' : >=2 paths look like live (no backup/archive markers)
|
||||
- 'all-archive' : every path is inside Archive/ or backup-like
|
||||
- 'other'
|
||||
"""
|
||||
def is_backup(p):
|
||||
s = p.lower()
|
||||
return ".bak" in s or "/backup" in s or "backups/" in s
|
||||
def is_archive(p):
|
||||
s = p.lower()
|
||||
return "/archive/" in s
|
||||
backups = [p for p in paths if is_backup(p)]
|
||||
archives = [p for p in paths if is_archive(p)]
|
||||
live = [p for p in paths if not is_backup(p) and not is_archive(p)]
|
||||
if len(live) == 1 and len(backups) >= 1 and len(archives) == 0:
|
||||
return "live+backup"
|
||||
if len(live) == 1 and len(archives) >= 1 and len(backups) == 0:
|
||||
return "live+archive"
|
||||
if len(live) == 1 and (len(backups) + len(archives)) >= 1:
|
||||
return "live+mixed-old"
|
||||
if len(live) >= 2:
|
||||
return "multi-live"
|
||||
if len(live) == 0:
|
||||
return "all-archive-or-backup"
|
||||
return "other"
|
||||
|
||||
|
||||
# ─── Section 1: Cohort recap ────────────────────────────────────────────────
|
||||
|
||||
def section_1_cohort_recap(cur):
|
||||
header("1. COHORT RECAP")
|
||||
cur.execute("""
|
||||
SELECT
|
||||
COUNT(*) AS total,
|
||||
COUNT(*) FILTER (WHERE type IS NULL) AS type_null,
|
||||
COUNT(*) FILTER (WHERE created_at IS NULL) AS ca_null,
|
||||
COUNT(*) FILTER (WHERE type IS NULL AND created_at IS NULL) AS both_null,
|
||||
COUNT(*) FILTER (WHERE type IS NOT NULL AND created_at IS NOT NULL) AS both_set
|
||||
FROM embeddings;
|
||||
""")
|
||||
overall = cur.fetchone()
|
||||
print(f"Total: {overall['total']} type_null: {overall['type_null']} "
|
||||
f"ca_null: {overall['ca_null']} both_null: {overall['both_null']} "
|
||||
f"both_set: {overall['both_set']}")
|
||||
|
||||
cur.execute("""
|
||||
SELECT type, created_at IS NULL AS ca_null, COUNT(*) AS n
|
||||
FROM embeddings GROUP BY type, ca_null ORDER BY type NULLS LAST, ca_null;
|
||||
""")
|
||||
cohorts = cur.fetchall()
|
||||
sub("Per-(type, ca_null) cohorts")
|
||||
for r in cohorts:
|
||||
print(f" type={r['type'] or 'NULL':<22} ca_null={r['ca_null']!s:<5} n={r['n']}")
|
||||
return {"overall": overall, "cohorts": cohorts}
|
||||
|
||||
|
||||
# ─── Section 2: Cohort A type inference ─────────────────────────────────────
|
||||
|
||||
def section_2_type_inference(cur):
|
||||
header("2. COHORT A TYPE INFERENCE (extension classifier)")
|
||||
cur.execute("""
|
||||
SELECT LOWER(SUBSTRING(source FROM '\.[^.]+$')) AS ext, COUNT(*) AS rows
|
||||
FROM embeddings WHERE type IS NULL
|
||||
GROUP BY ext ORDER BY rows DESC;
|
||||
""")
|
||||
by_ext = cur.fetchall()
|
||||
classified = sum(r["rows"] for r in by_ext if r["ext"] in SUPPORTED_EXT)
|
||||
unknown = sum(r["rows"] for r in by_ext if r["ext"] not in SUPPORTED_EXT)
|
||||
print(f"NULL-type rows by extension:")
|
||||
for r in by_ext:
|
||||
flag = "OK" if r["ext"] in SUPPORTED_EXT else "??"
|
||||
print(f" {flag} {r['ext'] or '(none)':<8} rows={r['rows']}")
|
||||
print(f"\nClassified as 'document' via extension: {classified}")
|
||||
print(f"Unclassifiable (no SUPPORTED extension): {unknown}")
|
||||
return {"by_ext": by_ext, "classified": classified, "unclassifiable": unknown}
|
||||
|
||||
|
||||
# ─── Section 3: created_at inference ────────────────────────────────────────
|
||||
|
||||
def section_3_created_at_inference(cur):
|
||||
header("3. CREATED_AT INFERENCE — file-derived rows")
|
||||
by_path, by_name = load_watcher_state()
|
||||
print(f"watcher_state.json: {len(by_path)} tracked paths, "
|
||||
f"{len(by_name)} distinct filenames, "
|
||||
f"{sum(1 for v in by_name.values() if len(v) > 1)} filename collisions")
|
||||
|
||||
# 3a. Rows with metadata.filepath: probe stat()
|
||||
sub("3a. Rows with metadata.filepath — stat probe")
|
||||
cur.execute("""
|
||||
SELECT id, source, metadata->>'filepath' AS filepath
|
||||
FROM embeddings
|
||||
WHERE created_at IS NULL AND metadata->>'filepath' IS NOT NULL;
|
||||
""")
|
||||
rows_with_fp = cur.fetchall()
|
||||
fp_exists = 0
|
||||
fp_missing = 0
|
||||
fp_outside_root = 0
|
||||
sample_resolved = []
|
||||
for r in rows_with_fp:
|
||||
p = Path(r["filepath"])
|
||||
if not str(p).startswith(str(NEXTCLOUD_ROOT)):
|
||||
fp_outside_root += 1
|
||||
if p.exists():
|
||||
fp_exists += 1
|
||||
if len(sample_resolved) < 5:
|
||||
sample_resolved.append({
|
||||
"id": r["id"], "source": r["source"],
|
||||
"filepath": str(p), "mtime": fmt_ts_from_st_mtime(p),
|
||||
})
|
||||
else:
|
||||
fp_missing += 1
|
||||
print(f" rows with metadata.filepath: {len(rows_with_fp)}")
|
||||
print(f" exists on disk: {fp_exists}")
|
||||
print(f" missing on disk: {fp_missing}")
|
||||
print(f" outside Nextcloud root: {fp_outside_root}")
|
||||
print(f" Sample of 5 resolved mtimes:")
|
||||
for s in sample_resolved:
|
||||
print(f" {s['id']:<15} {s['source'][:60]:<60} mtime={s['mtime']}")
|
||||
|
||||
# 3b. Rows without metadata.filepath: watcher_state lookup
|
||||
sub("3b. Rows without metadata.filepath — watcher_state lookup")
|
||||
cur.execute("""
|
||||
SELECT id, source FROM embeddings
|
||||
WHERE created_at IS NULL
|
||||
AND metadata->>'filepath' IS NULL
|
||||
AND type IS NULL OR (type='document' AND created_at IS NULL AND metadata->>'filepath' IS NULL);
|
||||
""")
|
||||
rows_no_fp = cur.fetchall()
|
||||
# Distinct source basenames to look up
|
||||
basenames_to_resolve = sorted({r["source"] for r in rows_no_fp if r["source"]})
|
||||
n_resolved_unique = sum(1 for n in basenames_to_resolve if len(by_name.get(n, [])) == 1)
|
||||
n_collision_unique = sum(1 for n in basenames_to_resolve if len(by_name.get(n, [])) > 1)
|
||||
n_unfound = sum(1 for n in basenames_to_resolve if n not in by_name)
|
||||
print(f" rows without filepath: {len(rows_no_fp)}")
|
||||
print(f" distinct source basenames to resolve: {len(basenames_to_resolve)}")
|
||||
print(f" unique watcher_state hit (no collision): {n_resolved_unique}")
|
||||
print(f" collision in watcher_state (>1 path): {n_collision_unique}")
|
||||
print(f" not in watcher_state at all: {n_unfound}")
|
||||
|
||||
# 3c. Collision-shape audit
|
||||
sub("3c. Collision-shape audit — all collisions in watcher_state")
|
||||
collisions = {n: [(p, m) for p, m in by_name[n]] for n in by_name if len(by_name[n]) > 1}
|
||||
shape_counts = Counter()
|
||||
rows_affected_by_shape = Counter()
|
||||
# Map from basename to count of NULL-ca rows that need it (rows_no_fp)
|
||||
rows_no_fp_by_name = Counter(r["source"] for r in rows_no_fp)
|
||||
sample_per_shape = defaultdict(list)
|
||||
for name, paths_mtimes in collisions.items():
|
||||
paths = [p for p, _ in paths_mtimes]
|
||||
shape = classify_collision_shape(paths)
|
||||
shape_counts[shape] += 1
|
||||
rows_affected_by_shape[shape] += rows_no_fp_by_name.get(name, 0)
|
||||
if len(sample_per_shape[shape]) < 3:
|
||||
entry = {
|
||||
"name": name,
|
||||
"rows_no_fp_using_this_name": rows_no_fp_by_name.get(name, 0),
|
||||
"candidates": [
|
||||
{"path": p, "mtime": fmt_ts_from_unix(m)}
|
||||
for p, m in sorted(paths_mtimes, key=lambda x: -float(x[1]))
|
||||
],
|
||||
}
|
||||
sample_per_shape[shape].append(entry)
|
||||
print(f" collisions in watcher_state: {len(collisions)}")
|
||||
print(f" shape breakdown:")
|
||||
for shape, n in shape_counts.most_common():
|
||||
print(f" {shape:<22} collisions={n:<4} rows_affected={rows_affected_by_shape[shape]}")
|
||||
print(f"\n Up-to-3 sample collisions per shape (sorted by mtime desc):")
|
||||
for shape, samples in sample_per_shape.items():
|
||||
print(f" [{shape}]")
|
||||
for s in samples:
|
||||
print(f" {s['name']} (rows_no_fp using this name: {s['rows_no_fp_using_this_name']})")
|
||||
for c in s["candidates"]:
|
||||
print(f" {c['mtime']} {c['path']}")
|
||||
|
||||
return {
|
||||
"watcher_state_paths": len(by_path),
|
||||
"watcher_state_basenames": len(by_name),
|
||||
"watcher_state_collisions": len(collisions),
|
||||
"rows_with_filepath": {
|
||||
"total": len(rows_with_fp),
|
||||
"exists": fp_exists, "missing": fp_missing,
|
||||
"outside_root": fp_outside_root,
|
||||
"sample": sample_resolved,
|
||||
},
|
||||
"rows_without_filepath": {
|
||||
"total": len(rows_no_fp),
|
||||
"distinct_basenames": len(basenames_to_resolve),
|
||||
"unique_hit": n_resolved_unique,
|
||||
"collision_hit": n_collision_unique,
|
||||
"unfound": n_unfound,
|
||||
},
|
||||
"collision_shapes": {
|
||||
"total": len(collisions),
|
||||
"shape_counts": dict(shape_counts),
|
||||
"rows_affected_by_shape": dict(rows_affected_by_shape),
|
||||
"samples": {k: v for k, v in sample_per_shape.items()},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
# ─── Section 4: ChatGPT export resolution ───────────────────────────────────
|
||||
|
||||
def section_4_chatgpt_export(cur):
|
||||
header("4. CHATGPT EXPORT RESOLUTION (Plan addition #1)")
|
||||
print(f"Probing: {CHATGPT_EXPORT_DIR}")
|
||||
if not CHATGPT_EXPORT_DIR.exists():
|
||||
print(" NOT FOUND — plan on sentinel for entire B-chatgpt cohort.")
|
||||
return {"export_dir_exists": False, "files": []}
|
||||
files = sorted(CHATGPT_EXPORT_DIR.glob("conversations*.json"))
|
||||
print(f" found {len(files)} export file(s):")
|
||||
for f in files:
|
||||
print(f" {f.name} size={f.stat().st_size:,} mtime={fmt_ts_from_st_mtime(f)}")
|
||||
|
||||
# Build convo_id -> create_time index from all export files.
|
||||
print("\nLoading export(s) to build convo_id -> create_time index...")
|
||||
convo_index = {}
|
||||
for f in files:
|
||||
try:
|
||||
data = json.loads(f.read_text(encoding="utf-8"))
|
||||
except Exception as e:
|
||||
print(f" failed to parse {f.name}: {e}")
|
||||
continue
|
||||
for convo in data:
|
||||
cid = convo.get("id") or convo.get("conversation_id")
|
||||
ct = convo.get("create_time")
|
||||
if cid and ct is not None:
|
||||
convo_index[cid] = ct
|
||||
print(f" indexed {len(convo_index)} conversations across {len(files)} export files")
|
||||
|
||||
# Sample 5 chatgpt_conversation rows; resolve.
|
||||
cur.execute("""
|
||||
SELECT id, source FROM embeddings
|
||||
WHERE type='chatgpt_conversation' AND created_at IS NULL
|
||||
ORDER BY random() LIMIT 5;
|
||||
""")
|
||||
sample = cur.fetchall()
|
||||
sub("Sample of 5 B-chatgpt rows: convo lookup")
|
||||
resolved = 0
|
||||
sample_results = []
|
||||
for r in sample:
|
||||
# IDs look like chatgpt_<uuid>_<idx>; uuid extends until last underscore.
|
||||
m = re.match(r"^chatgpt_(.+)_(\d+)$", r["id"])
|
||||
cid = m.group(1) if m else None
|
||||
ct = convo_index.get(cid)
|
||||
ct_iso = None
|
||||
if ct is not None:
|
||||
try:
|
||||
ct_iso = datetime.fromtimestamp(float(ct), tz=timezone.utc).isoformat().replace("+00:00", "Z")
|
||||
except Exception:
|
||||
ct_iso = None
|
||||
if ct_iso:
|
||||
resolved += 1
|
||||
sample_results.append({
|
||||
"id": r["id"], "source": r["source"], "convo_id": cid,
|
||||
"create_time": ct, "create_time_iso": ct_iso,
|
||||
"resolved": ct_iso is not None,
|
||||
})
|
||||
print(f" {r['id']} cid={cid}")
|
||||
print(f" -> create_time={ct} iso={ct_iso}")
|
||||
print(f"\nResolved {resolved}/5. "
|
||||
f"{'PROCEED with re-derive for full cohort.' if resolved == 5 else 'PARTIAL — plan re-derive + sentinel for unresolved.'}")
|
||||
|
||||
# Estimate full-cohort coverage by counting how many B-chatgpt convo_ids appear in the index.
|
||||
cur.execute("""
|
||||
SELECT DISTINCT regexp_replace(id, '^chatgpt_(.+)_\\d+$', '\\1') AS cid
|
||||
FROM embeddings WHERE type='chatgpt_conversation' AND created_at IS NULL;
|
||||
""")
|
||||
distinct_cids = [r["cid"] for r in cur.fetchall()]
|
||||
in_index = sum(1 for c in distinct_cids if c in convo_index)
|
||||
print(f"Full-cohort coverage estimate: {in_index} / {len(distinct_cids)} distinct convo_ids "
|
||||
f"resolvable from export.")
|
||||
return {
|
||||
"export_dir_exists": True,
|
||||
"files": [{"name": f.name, "size": f.stat().st_size, "mtime": fmt_ts_from_st_mtime(f)} for f in files],
|
||||
"convo_index_size": len(convo_index),
|
||||
"sample_results": sample_results,
|
||||
"sample_resolved": resolved,
|
||||
"full_cohort": {
|
||||
"distinct_convo_ids": len(distinct_cids),
|
||||
"resolvable_from_export": in_index,
|
||||
"unresolvable": len(distinct_cids) - in_index,
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
# ─── Section 5: Sentinel date discovery ─────────────────────────────────────
|
||||
|
||||
def section_5_sentinel(cur):
|
||||
header("5. SENTINEL DATE DISCOVERY (Plan addition #3)")
|
||||
|
||||
# 5a. Earliest non-NULL created_at per type: lower bound on substrate age.
|
||||
sub("5a. Earliest non-NULL created_at per type")
|
||||
cur.execute("""
|
||||
SELECT type, MIN(created_at) AS earliest, MAX(created_at) AS latest, COUNT(*) AS rows
|
||||
FROM embeddings WHERE created_at IS NOT NULL GROUP BY type ORDER BY type;
|
||||
""")
|
||||
rows = cur.fetchall()
|
||||
for r in rows:
|
||||
print(f" {r['type']:<22} earliest={r['earliest']:<32} latest={r['latest']}")
|
||||
|
||||
# 5b. git log for the pgvector-migration commit.
|
||||
sub("5b. Git log — pgvector migration commits")
|
||||
git_findings = []
|
||||
try:
|
||||
out = subprocess.run(
|
||||
["git", "log", "--all", "--format=%H %ci %s",
|
||||
"--", "deprecated/migrate_to_pgvector.py", "scripts/migrate_to_pgvector.py"],
|
||||
cwd=str(Path.home() / "aaronai"), capture_output=True, text=True, timeout=10,
|
||||
)
|
||||
for line in out.stdout.strip().splitlines():
|
||||
print(f" {line}")
|
||||
git_findings.append(line)
|
||||
except Exception as e:
|
||||
print(f" git log failed: {e}")
|
||||
# Also: when did the api/ingest scripts cut over to pgvector?
|
||||
try:
|
||||
out = subprocess.run(
|
||||
["git", "log", "--all", "--format=%H %ci %s", "--grep=pgvector", "-i"],
|
||||
cwd=str(Path.home() / "aaronai"), capture_output=True, text=True, timeout=10,
|
||||
)
|
||||
print("\n Commits mentioning pgvector:")
|
||||
for line in out.stdout.strip().splitlines()[:10]:
|
||||
print(f" {line}")
|
||||
git_findings.append(line)
|
||||
except Exception as e:
|
||||
print(f" git log (pgvector grep) failed: {e}")
|
||||
|
||||
# 5c. ChromaDB sqlite still on disk?
|
||||
sub("5c. ChromaDB dump on disk?")
|
||||
candidates = []
|
||||
for root in [Path.home() / "aaronai", Path.home() / "aaronai" / "db"]:
|
||||
if root.exists():
|
||||
for p in root.rglob("chroma*.sqlite*"):
|
||||
candidates.append({"path": str(p), "mtime": fmt_ts_from_st_mtime(p)})
|
||||
if candidates:
|
||||
for c in candidates:
|
||||
print(f" found: {c['path']} mtime={c['mtime']}")
|
||||
else:
|
||||
print(" no ChromaDB sqlite found under ~/aaronai")
|
||||
|
||||
# 5d. Propose sentinel.
|
||||
sub("5d. Sentinel proposal")
|
||||
# Earliest doc cutover: per query, document=2026-04-30. Migration commit f78b830 was
|
||||
# 2026-04-26. Most defensible sentinel for "rows that entered pgvector before NOW()
|
||||
# writes were canonical" = the migration commit date.
|
||||
proposed = "2026-04-26T00:00:00Z"
|
||||
reasoning = (
|
||||
"git f78b830 'Migrate to pgvector — remove ChromaDB from api.py, ingest scripts, "
|
||||
"dream.py' is dated 2026-04-26. The earliest type='document' row with a non-NULL "
|
||||
"created_at lands 2026-04-30 (the F11 canonical-encoding cutover). Rows with NULL "
|
||||
"created_at all predate F11 and most predate the pgvector cutover itself. "
|
||||
"2026-04-26 is the date the ChromaDB->pgvector migration script was committed, "
|
||||
"so any row currently in the embeddings table with NULL created_at must have been "
|
||||
"ingested on or after that date (when the table came into existence in current form). "
|
||||
"It is the tightest defensible upper bound on 'the row entered pgvector before "
|
||||
"timestamps were tracked', so it is the right sentinel."
|
||||
)
|
||||
print(f" Proposed sentinel: {proposed}")
|
||||
print(f" Reasoning: {reasoning}")
|
||||
|
||||
return {
|
||||
"earliest_per_type": rows,
|
||||
"git_findings": git_findings,
|
||||
"chromadb_candidates": candidates,
|
||||
"proposed_sentinel": proposed,
|
||||
"reasoning": reasoning,
|
||||
}
|
||||
|
||||
|
||||
# ─── Section 6: 50-row stratified sample ────────────────────────────────────
|
||||
|
||||
def section_6_stratified_sample(cur, sentinel_iso):
|
||||
header("6. 50-ROW STRATIFIED SAMPLE — derived (type, created_at, source)")
|
||||
by_path, by_name = load_watcher_state()
|
||||
|
||||
cohorts = [
|
||||
("A (type NULL, ca NULL)", "type IS NULL AND created_at IS NULL", 10),
|
||||
("B-doc-old (type='document', ca NULL)", "type='document' AND created_at IS NULL", 10),
|
||||
("B-chatgpt (type='chatgpt_conversation', ca NULL)", "type='chatgpt_conversation' AND created_at IS NULL", 10),
|
||||
("C-doc-new (type='document', ca set)", "type='document' AND created_at IS NOT NULL", 10),
|
||||
("C-claude (type='claude_conversation', ca set)", "type='claude_conversation' AND created_at IS NOT NULL", 5),
|
||||
("C-aaronai (type='aaronai_conversation', ca set)", "type='aaronai_conversation' AND created_at IS NOT NULL", 5),
|
||||
]
|
||||
|
||||
samples = []
|
||||
for label, predicate, n in cohorts:
|
||||
sub(f"{label} (sample size: {n})")
|
||||
cur.execute(f"""
|
||||
SELECT id, source, type, created_at, metadata
|
||||
FROM embeddings WHERE {predicate}
|
||||
ORDER BY random() LIMIT %s;
|
||||
""", (n,))
|
||||
rows = cur.fetchall()
|
||||
for r in rows:
|
||||
row_meta = r["metadata"] or {}
|
||||
fp = row_meta.get("filepath")
|
||||
inferred_type = r["type"] or ("document" if (r["source"] or "").lower().endswith(tuple(SUPPORTED_EXT)) else "?")
|
||||
inferred_ca = r["created_at"]
|
||||
inferred_ca_source = "preserved" if inferred_ca else None
|
||||
if not inferred_ca:
|
||||
if fp and Path(fp).exists():
|
||||
inferred_ca = fmt_ts_from_st_mtime(Path(fp))
|
||||
inferred_ca_source = "filepath_stat"
|
||||
elif r["source"] and r["source"] in by_name:
|
||||
candidates = by_name[r["source"]]
|
||||
if len(candidates) == 1:
|
||||
inferred_ca = fmt_ts_from_unix(candidates[0][1])
|
||||
inferred_ca_source = "watcher_state_unique"
|
||||
else:
|
||||
# take most recent
|
||||
latest = max(candidates, key=lambda x: float(x[1]))
|
||||
inferred_ca = fmt_ts_from_unix(latest[1])
|
||||
inferred_ca_source = f"watcher_state_collision_pick_latest_of_{len(candidates)}"
|
||||
else:
|
||||
inferred_ca = sentinel_iso
|
||||
inferred_ca_source = "sentinel"
|
||||
print(f" id={r['id']:<22} src={(r['source'] or '')[:38]:<38}")
|
||||
print(f" existing: type={r['type']!r:<22} ca={r['created_at']!r}")
|
||||
print(f" inferred: type={inferred_type!r:<22} ca={inferred_ca!r} ({inferred_ca_source})")
|
||||
samples.append({
|
||||
"cohort": label, "id": r["id"], "source": r["source"],
|
||||
"existing_type": r["type"], "existing_ca": r["created_at"],
|
||||
"inferred_type": inferred_type, "inferred_ca": inferred_ca,
|
||||
"inferred_ca_source": inferred_ca_source,
|
||||
})
|
||||
return samples
|
||||
|
||||
|
||||
# ─── Driver ─────────────────────────────────────────────────────────────────
|
||||
|
||||
def main():
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
|
||||
out = {"generated_at": datetime.now(timezone.utc).isoformat()}
|
||||
out["section_1"] = section_1_cohort_recap(cur)
|
||||
out["section_2"] = section_2_type_inference(cur)
|
||||
out["section_3"] = section_3_created_at_inference(cur)
|
||||
out["section_4"] = section_4_chatgpt_export(cur)
|
||||
out["section_5"] = section_5_sentinel(cur)
|
||||
sentinel_iso = out["section_5"]["proposed_sentinel"]
|
||||
out["section_6"] = section_6_stratified_sample(cur, sentinel_iso)
|
||||
|
||||
pg.close()
|
||||
|
||||
# JSON sidecar — strip non-serializables.
|
||||
def _serialize(o):
|
||||
if isinstance(o, datetime):
|
||||
return o.isoformat()
|
||||
return str(o)
|
||||
|
||||
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
|
||||
OUT_PATH.write_text(json.dumps(out, indent=2, default=_serialize))
|
||||
print(f"\nJSON sidecar written: {OUT_PATH}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,296 @@
|
||||
"""Read-only analysis of Stage 2 frame data via stage2_frames_v.
|
||||
|
||||
Produces seven sections (frequency, hygiene, per-doc count, co-occurrence,
|
||||
folder cross-tab, worker-version split, data-gap accounting) and writes a JSON
|
||||
sidecar for diffing across runs.
|
||||
|
||||
Usage: venv/bin/python3 scripts/experiments/frame_distribution_report.py
|
||||
"""
|
||||
import os
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from collections import Counter, defaultdict
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
import psycopg2
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv()
|
||||
|
||||
OUT_PATH = Path.home() / "aaronai" / "experiments" / f"frame_distribution_{datetime.now().strftime('%Y-%m-%d')}.json"
|
||||
TOP_K = 20 # for co-occurrence; revisit after seeing the long tail
|
||||
|
||||
|
||||
def normalize(label):
|
||||
return re.sub(r"\s+", " ", label.strip().lower().replace("_", " "))
|
||||
|
||||
|
||||
def folder_bin(source):
|
||||
"""Classify source by type. stage_3_queue stores bare filenames, so we
|
||||
bin by what kind of file it is, not where it lives in the tree."""
|
||||
if not source:
|
||||
return "unknown"
|
||||
if re.match(r"^(Claude|ChatGPT|Aaron AI):", source):
|
||||
return "conversation" # bypasses Stage 2/3, will not appear here
|
||||
s = source.lower()
|
||||
if re.search(r"\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-voice\.md$", s):
|
||||
return "voice_note"
|
||||
if re.search(r"\d{4}-\d{2}-\d{2}-(nrem|early-rem|late-rem|synthesis|lucid)", s):
|
||||
return "dream_output"
|
||||
if s.endswith(".md"):
|
||||
return "markdown"
|
||||
if s.endswith(".pdf"):
|
||||
return "pdf"
|
||||
if s.endswith(".docx") or s.endswith(".doc"):
|
||||
return "docx"
|
||||
if s.endswith(".pptx") or s.endswith(".ppt"):
|
||||
return "pptx"
|
||||
if s.endswith(".txt"):
|
||||
return "txt"
|
||||
return "other"
|
||||
|
||||
|
||||
def fetch_rows(cur):
|
||||
cur.execute("""
|
||||
SELECT source, char_length, active_frames, worker_version, raw_metadata
|
||||
FROM stage2_frames_v
|
||||
""")
|
||||
rows = []
|
||||
for source, char_length, frames, worker_version, raw in cur.fetchall():
|
||||
if not isinstance(frames, list):
|
||||
continue
|
||||
rows.append({
|
||||
"source": source,
|
||||
"char_length": char_length,
|
||||
"frames": [str(f) for f in frames if f],
|
||||
"worker_version": worker_version,
|
||||
"raw_keys": sorted(raw.keys()) if isinstance(raw, dict) else [],
|
||||
})
|
||||
return rows
|
||||
|
||||
|
||||
def section_frequency(rows):
|
||||
counter = Counter()
|
||||
for r in rows:
|
||||
for f in r["frames"]:
|
||||
counter[f] += 1
|
||||
return counter
|
||||
|
||||
|
||||
def section_hygiene(frequency):
|
||||
"""Group raw labels by normalized form; flag collisions."""
|
||||
groups = defaultdict(list)
|
||||
for raw, count in frequency.items():
|
||||
groups[normalize(raw)].append((raw, count))
|
||||
collisions = {k: v for k, v in groups.items() if len(v) > 1}
|
||||
return collisions
|
||||
|
||||
|
||||
def section_per_doc_count(rows):
|
||||
counts = Counter(len(r["frames"]) for r in rows)
|
||||
return counts
|
||||
|
||||
|
||||
def section_cooccurrence(rows, top_frames):
|
||||
top_set = set(top_frames)
|
||||
pair_counts = Counter()
|
||||
for r in rows:
|
||||
present = [f for f in r["frames"] if f in top_set]
|
||||
for i in range(len(present)):
|
||||
for j in range(i + 1, len(present)):
|
||||
a, b = sorted([present[i], present[j]])
|
||||
pair_counts[(a, b)] += 1
|
||||
return pair_counts
|
||||
|
||||
|
||||
def section_folder_crosstab(rows, top_frames):
|
||||
top_set = set(top_frames)
|
||||
table = defaultdict(Counter) # frame -> bin -> count
|
||||
bin_totals = Counter()
|
||||
for r in rows:
|
||||
b = folder_bin(r["source"])
|
||||
bin_totals[b] += 1
|
||||
for f in r["frames"]:
|
||||
if f in top_set:
|
||||
table[f][b] += 1
|
||||
return table, bin_totals
|
||||
|
||||
|
||||
def section_worker_versions(rows):
|
||||
counter = Counter(r["worker_version"] or "unknown" for r in rows)
|
||||
raw_keys_by_version = defaultdict(Counter)
|
||||
for r in rows:
|
||||
v = r["worker_version"] or "unknown"
|
||||
raw_keys_by_version[v][tuple(r["raw_keys"])] += 1
|
||||
return counter, raw_keys_by_version
|
||||
|
||||
|
||||
def section_data_gap(cur):
|
||||
"""Docs that completed Stage 2 but never had frames extracted (<2000 chars)."""
|
||||
cur.execute("""
|
||||
SELECT source, char_length
|
||||
FROM stage_2_queue
|
||||
WHERE completed_at IS NOT NULL AND char_length < 2000
|
||||
""")
|
||||
missing = cur.fetchall()
|
||||
by_bin = Counter(folder_bin(s) for s, _ in missing)
|
||||
char_lengths = [c for _, c in missing]
|
||||
return {
|
||||
"count": len(missing),
|
||||
"by_type_bin": dict(by_bin),
|
||||
"char_length": {
|
||||
"min": min(char_lengths) if char_lengths else None,
|
||||
"max": max(char_lengths) if char_lengths else None,
|
||||
"median": sorted(char_lengths)[len(char_lengths) // 2] if char_lengths else None,
|
||||
},
|
||||
"sample_sources": [s for s, _ in missing[:10]],
|
||||
}
|
||||
|
||||
|
||||
def section_corpus_coverage(cur):
|
||||
"""How much of the embeddings corpus has frame coverage?"""
|
||||
cur.execute("SELECT count(DISTINCT source) FROM embeddings")
|
||||
total = cur.fetchone()[0]
|
||||
cur.execute("""
|
||||
SELECT count(DISTINCT source) FROM embeddings
|
||||
WHERE source LIKE 'Claude:%' OR source LIKE 'ChatGPT:%'
|
||||
OR source LIKE 'Aaron AI:%' OR type='aaronai_conversation'
|
||||
""")
|
||||
conversations = cur.fetchone()[0]
|
||||
cur.execute("SELECT count(DISTINCT source) FROM stage_3_queue WHERE stage2_metadata IS NOT NULL")
|
||||
with_frames = cur.fetchone()[0]
|
||||
cur.execute("""
|
||||
SELECT count(DISTINCT source) FROM stage_2_queue
|
||||
WHERE completed_at IS NOT NULL AND char_length < 2000
|
||||
""")
|
||||
short_no_frames = cur.fetchone()[0]
|
||||
cur.execute("""
|
||||
SELECT count(DISTINCT source) FROM stage_2_queue
|
||||
WHERE failed_at IS NOT NULL
|
||||
""")
|
||||
failed = cur.fetchone()[0]
|
||||
return {
|
||||
"total_distinct_sources_in_embeddings": total,
|
||||
"conversations_no_frames_by_design": conversations,
|
||||
"files_with_frames": with_frames,
|
||||
"files_short_no_frames": short_no_frames,
|
||||
"files_stage2_failed": failed,
|
||||
"frame_coverage_pct": round(100.0 * with_frames / max(total, 1), 1),
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
conn = psycopg2.connect(os.environ["PG_DSN"])
|
||||
cur = conn.cursor()
|
||||
|
||||
rows = fetch_rows(cur)
|
||||
n_docs = len(rows)
|
||||
print(f"=== Stage 2 frame distribution report ({n_docs} docs) ===\n")
|
||||
|
||||
# 1. Frequency
|
||||
freq = section_frequency(rows)
|
||||
print(f"--- 1. Frame frequency ({len(freq)} distinct labels) ---")
|
||||
for label, count in freq.most_common(30):
|
||||
print(f" {count:5d} {label}")
|
||||
print()
|
||||
|
||||
# 2. Hygiene
|
||||
collisions = section_hygiene(freq)
|
||||
print(f"--- 2. Label hygiene (normalized collisions: {len(collisions)}) ---")
|
||||
for norm, variants in sorted(collisions.items(), key=lambda kv: -sum(c for _, c in kv[1])):
|
||||
variant_str = ", ".join(f"{r!r}:{c}" for r, c in sorted(variants, key=lambda x: -x[1]))
|
||||
print(f" '{norm}': {variant_str}")
|
||||
print()
|
||||
|
||||
# 3. Per-doc frame count
|
||||
per_doc = section_per_doc_count(rows)
|
||||
print("--- 3. Per-doc frame count ---")
|
||||
for n in sorted(per_doc):
|
||||
print(f" {n} frames: {per_doc[n]} docs")
|
||||
print()
|
||||
|
||||
# 4. Co-occurrence (top-K)
|
||||
top_frames = [f for f, _ in freq.most_common(TOP_K)]
|
||||
pairs = section_cooccurrence(rows, top_frames)
|
||||
print(f"--- 4. Co-occurrence (top-{TOP_K} frames, top-30 pairs) ---")
|
||||
for (a, b), count in pairs.most_common(30):
|
||||
print(f" {count:4d} {a} × {b}")
|
||||
print()
|
||||
|
||||
# 5. Folder cross-tab
|
||||
crosstab, bin_totals = section_folder_crosstab(rows, top_frames)
|
||||
print(f"--- 5. Frame × folder cross-tab (top-{TOP_K} frames) ---")
|
||||
bins_sorted = [b for b, _ in bin_totals.most_common()]
|
||||
print(f" bins (with totals): " + ", ".join(f"{b}({n})" for b, n in bin_totals.most_common(10)))
|
||||
for f in top_frames:
|
||||
row_data = crosstab[f]
|
||||
if not row_data:
|
||||
continue
|
||||
cells = ", ".join(f"{b}={c}" for b, c in row_data.most_common(5))
|
||||
print(f" {f}: {cells}")
|
||||
print()
|
||||
|
||||
# 6. Worker versions
|
||||
versions, keys_by_version = section_worker_versions(rows)
|
||||
print("--- 6. Worker version split ---")
|
||||
for v, count in versions.most_common():
|
||||
print(f" v{v}: {count} docs")
|
||||
top_shapes = keys_by_version[v].most_common(3)
|
||||
for keys, kcount in top_shapes:
|
||||
print(f" {kcount} docs with keys={list(keys)}")
|
||||
print()
|
||||
|
||||
# 7. Data gap
|
||||
gap = section_data_gap(cur)
|
||||
print("--- 7. Data-gap accounting (Stage 2 docs <2000 chars; never frame-extracted) ---")
|
||||
print(f" count: {gap['count']}")
|
||||
print(f" char_length: min={gap['char_length']['min']}, median={gap['char_length']['median']}, max={gap['char_length']['max']}")
|
||||
print(f" by type bin: {gap['by_type_bin']}")
|
||||
print(f" sample sources: {gap['sample_sources']}")
|
||||
print()
|
||||
|
||||
# 8. Corpus coverage
|
||||
coverage = section_corpus_coverage(cur)
|
||||
print("--- 8. Corpus-wide frame coverage ---")
|
||||
print(f" total distinct sources in embeddings: {coverage['total_distinct_sources_in_embeddings']}")
|
||||
print(f" conversations (no frames by design): {coverage['conversations_no_frames_by_design']}")
|
||||
print(f" files with frames: {coverage['files_with_frames']}")
|
||||
print(f" files short, no frames: {coverage['files_short_no_frames']}")
|
||||
print(f" files Stage 2 failed: {coverage['files_stage2_failed']}")
|
||||
print(f" frame coverage: {coverage['frame_coverage_pct']}% of corpus")
|
||||
print()
|
||||
|
||||
# JSON sidecar
|
||||
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
|
||||
sidecar = {
|
||||
"generated_at": datetime.now().isoformat(),
|
||||
"n_docs_with_frames": n_docs,
|
||||
"n_distinct_labels": len(freq),
|
||||
"top_30_frames": freq.most_common(30),
|
||||
"label_collisions": {
|
||||
k: [(r, c) for r, c in v] for k, v in collisions.items()
|
||||
},
|
||||
"per_doc_frame_count": dict(per_doc),
|
||||
"top_30_pairs": [
|
||||
{"a": a, "b": b, "count": c}
|
||||
for (a, b), c in pairs.most_common(30)
|
||||
],
|
||||
"folder_crosstab": {
|
||||
f: dict(crosstab[f]) for f in top_frames if crosstab[f]
|
||||
},
|
||||
"bin_totals": dict(bin_totals),
|
||||
"worker_versions": dict(versions),
|
||||
"data_gap": gap,
|
||||
"corpus_coverage": coverage,
|
||||
}
|
||||
OUT_PATH.write_text(json.dumps(sidecar, indent=2, default=str))
|
||||
print(f"JSON sidecar written: {OUT_PATH}")
|
||||
|
||||
cur.close()
|
||||
conn.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,257 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Experiment 005 — Actual API Token Measurement
|
||||
|
||||
Measures input token reduction from prepending v2 briefing vs raw document
|
||||
on Claude Haiku, validating the 42.0% modeled estimate from Experiment 002b.
|
||||
|
||||
Outputs: ~/aaronai/experiments/token_measurement_results.json
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import statistics
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import anthropic
|
||||
import psycopg2
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env")
|
||||
|
||||
INPUT_FILE = Path.home() / "aaronai" / "briefing_test_v2_results.json"
|
||||
OUTPUT_FILE = Path.home() / "aaronai" / "experiments" / "token_measurement_results.json"
|
||||
MODEL = "claude-haiku-4-5-20251001"
|
||||
MAX_TOKENS = 1024
|
||||
|
||||
EXTRACTION_PROMPT = (
|
||||
"Extract entities and their relationships from the document below. "
|
||||
"Return ONLY valid JSON with this schema:\n"
|
||||
"{\n"
|
||||
' "people": [string],\n'
|
||||
' "organizations": [string],\n'
|
||||
' "locations": [string],\n'
|
||||
' "dates": [string],\n'
|
||||
' "relationships": [{"subject": string, "predicate": string, "object": string}]\n'
|
||||
"}\n"
|
||||
"No prose, no markdown fences, no commentary. JSON only."
|
||||
)
|
||||
|
||||
|
||||
def fetch_document_text(pg_conn, source):
|
||||
"""Reconstruct the document by concatenating its chunks from pgvector."""
|
||||
cur = pg_conn.cursor()
|
||||
cur.execute(
|
||||
"SELECT document FROM embeddings WHERE source = %s ORDER BY id",
|
||||
(source,),
|
||||
)
|
||||
rows = cur.fetchall()
|
||||
cur.close()
|
||||
if not rows:
|
||||
return None
|
||||
return "\n\n".join(r[0] for r in rows)
|
||||
|
||||
|
||||
def build_raw_message(document_text):
|
||||
return f"{EXTRACTION_PROMPT}\n\nDOCUMENT:\n{document_text}"
|
||||
|
||||
|
||||
def build_briefed_message(briefing, document_text):
|
||||
briefing_str = json.dumps(briefing, indent=2)
|
||||
return (
|
||||
f"{EXTRACTION_PROMPT}\n\n"
|
||||
f"BRIEFING (pre-analysis from local model — use to orient):\n{briefing_str}\n\n"
|
||||
f"DOCUMENT:\n{document_text}"
|
||||
)
|
||||
|
||||
|
||||
def call_haiku(client, message_text):
|
||||
t0 = time.time()
|
||||
resp = client.messages.create(
|
||||
model=MODEL,
|
||||
max_tokens=MAX_TOKENS,
|
||||
messages=[{"role": "user", "content": message_text}],
|
||||
)
|
||||
return {
|
||||
"input_tokens": resp.usage.input_tokens,
|
||||
"output_tokens": resp.usage.output_tokens,
|
||||
"latency_s": round(time.time() - t0, 2),
|
||||
"response_text": resp.content[0].text if resp.content else "",
|
||||
"stop_reason": resp.stop_reason,
|
||||
}
|
||||
|
||||
|
||||
def ci_95(values):
|
||||
if len(values) < 2:
|
||||
return (statistics.mean(values) if values else 0.0, 0.0)
|
||||
mean = statistics.mean(values)
|
||||
half = 1.96 * statistics.stdev(values) / (len(values) ** 0.5)
|
||||
return (mean, half)
|
||||
|
||||
|
||||
def main():
|
||||
if not INPUT_FILE.exists():
|
||||
print(f"ERROR: {INPUT_FILE} not found", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
api_key = os.environ.get("ANTHROPIC_API_KEY")
|
||||
if not api_key:
|
||||
print("ERROR: ANTHROPIC_API_KEY not set", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
pg_dsn = os.environ.get("PG_DSN")
|
||||
if not pg_dsn:
|
||||
print("ERROR: PG_DSN not set", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
client = anthropic.Anthropic(api_key=api_key)
|
||||
pg_conn = psycopg2.connect(pg_dsn)
|
||||
|
||||
with open(INPUT_FILE) as f:
|
||||
v2_data = json.load(f)
|
||||
|
||||
docs_meta = [
|
||||
d for d in v2_data["documents"]
|
||||
if d.get("status") == "SUCCESS"
|
||||
and d.get("briefing")
|
||||
]
|
||||
|
||||
print(f"Loaded {len(docs_meta)} successful briefings from {INPUT_FILE.name}")
|
||||
print(f"Model: {MODEL}")
|
||||
print(f"Calls planned: up to {len(docs_meta) * 2}\n")
|
||||
|
||||
results = []
|
||||
started_at = datetime.now(timezone.utc).isoformat()
|
||||
t_total = time.time()
|
||||
|
||||
for i, doc in enumerate(docs_meta, 1):
|
||||
source = doc["source"]
|
||||
briefing = doc["briefing"]
|
||||
|
||||
document_text = fetch_document_text(pg_conn, source)
|
||||
if not document_text:
|
||||
print(f"[{i:02d}/{len(docs_meta)}] {source[:60]} -- SKIP (not in pgvector)")
|
||||
results.append({"source": source, "skipped": "not_in_pgvector"})
|
||||
continue
|
||||
|
||||
print(f"[{i:02d}/{len(docs_meta)}] {source[:60]}")
|
||||
|
||||
try:
|
||||
raw_result = call_haiku(client, build_raw_message(document_text))
|
||||
except Exception as e:
|
||||
print(f" RAW FAILED: {e}")
|
||||
raw_result = {"error": str(e)}
|
||||
|
||||
try:
|
||||
briefed_result = call_haiku(client, build_briefed_message(briefing, document_text))
|
||||
except Exception as e:
|
||||
print(f" BRIEFED FAILED: {e}")
|
||||
briefed_result = {"error": str(e)}
|
||||
|
||||
delta = None
|
||||
if "input_tokens" in raw_result and "input_tokens" in briefed_result:
|
||||
raw_in = raw_result["input_tokens"]
|
||||
briefed_in = briefed_result["input_tokens"]
|
||||
raw_out = raw_result["output_tokens"]
|
||||
briefed_out = briefed_result["output_tokens"]
|
||||
input_red = (raw_in - briefed_in) / raw_in * 100 if raw_in else 0.0
|
||||
output_delta = (briefed_out - raw_out) / raw_out * 100 if raw_out else 0.0
|
||||
delta = {
|
||||
"input_reduction_pct": round(input_red, 2),
|
||||
"output_delta_pct": round(output_delta, 2),
|
||||
"raw_input_tokens": raw_in,
|
||||
"briefed_input_tokens": briefed_in,
|
||||
"raw_output_tokens": raw_out,
|
||||
"briefed_output_tokens": briefed_out,
|
||||
}
|
||||
print(
|
||||
f" in: {raw_in} -> {briefed_in} ({input_red:+.1f}%) | "
|
||||
f"out: {raw_out} -> {briefed_out}"
|
||||
)
|
||||
|
||||
results.append({
|
||||
"source": source,
|
||||
"raw": raw_result,
|
||||
"briefed": briefed_result,
|
||||
"delta": delta,
|
||||
})
|
||||
|
||||
pg_conn.close()
|
||||
total_elapsed = round(time.time() - t_total, 1)
|
||||
|
||||
valid = [r for r in results if r.get("delta") is not None]
|
||||
skipped = [r for r in results if r.get("skipped")]
|
||||
reductions = [r["delta"]["input_reduction_pct"] for r in valid]
|
||||
output_deltas = [r["delta"]["output_delta_pct"] for r in valid]
|
||||
raw_in_total = sum(r["delta"]["raw_input_tokens"] for r in valid)
|
||||
briefed_in_total = sum(r["delta"]["briefed_input_tokens"] for r in valid)
|
||||
raw_out_total = sum(r["delta"]["raw_output_tokens"] for r in valid)
|
||||
briefed_out_total = sum(r["delta"]["briefed_output_tokens"] for r in valid)
|
||||
|
||||
HAIKU_IN = 1.0
|
||||
HAIKU_OUT = 5.0
|
||||
raw_cost = (raw_in_total * HAIKU_IN + raw_out_total * HAIKU_OUT) / 1_000_000
|
||||
briefed_cost = (briefed_in_total * HAIKU_IN + briefed_out_total * HAIKU_OUT) / 1_000_000
|
||||
|
||||
mean_red, ci_half = ci_95(reductions)
|
||||
mean_out_delta, _ = ci_95(output_deltas)
|
||||
|
||||
summary = {
|
||||
"experiment": "005",
|
||||
"title": "Actual API Token Measurement",
|
||||
"started_at": started_at,
|
||||
"completed_at": datetime.now(timezone.utc).isoformat(),
|
||||
"model": MODEL,
|
||||
"extraction_prompt": EXTRACTION_PROMPT,
|
||||
"n_documents_attempted": len(docs_meta),
|
||||
"n_skipped_not_in_pgvector": len(skipped),
|
||||
"n_valid_pairs": len(valid),
|
||||
"n_failed": len(docs_meta) - len(valid) - len(skipped),
|
||||
"total_elapsed_s": total_elapsed,
|
||||
"input_token_reduction": {
|
||||
"mean_pct": round(mean_red, 2),
|
||||
"ci_95_half_width_pct": round(ci_half, 2),
|
||||
"median_pct": round(statistics.median(reductions), 2) if reductions else None,
|
||||
"min_pct": round(min(reductions), 2) if reductions else None,
|
||||
"max_pct": round(max(reductions), 2) if reductions else None,
|
||||
"stdev_pct": round(statistics.stdev(reductions), 2) if len(reductions) > 1 else 0.0,
|
||||
},
|
||||
"output_token_delta": {"mean_pct": round(mean_out_delta, 2)},
|
||||
"totals": {
|
||||
"raw_input_tokens": raw_in_total,
|
||||
"briefed_input_tokens": briefed_in_total,
|
||||
"raw_output_tokens": raw_out_total,
|
||||
"briefed_output_tokens": briefed_out_total,
|
||||
"raw_cost_usd": round(raw_cost, 4),
|
||||
"briefed_cost_usd": round(briefed_cost, 4),
|
||||
"savings_usd": round(raw_cost - briefed_cost, 4),
|
||||
},
|
||||
"comparison_to_v2_estimate": {
|
||||
"v2_modeled_reduction_pct": 42.0,
|
||||
"measured_mean_reduction_pct": round(mean_red, 2),
|
||||
"delta_pct_points": round(mean_red - 42.0, 2),
|
||||
},
|
||||
"results": results,
|
||||
}
|
||||
|
||||
OUTPUT_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(OUTPUT_FILE, "w") as f:
|
||||
json.dump(summary, f, indent=2)
|
||||
|
||||
print()
|
||||
print("=" * 60)
|
||||
print(f"DONE — {len(valid)}/{len(docs_meta)} valid pairs in {total_elapsed}s")
|
||||
if skipped:
|
||||
print(f"Skipped (not in pgvector): {len(skipped)}")
|
||||
print(f"Mean input token reduction: {mean_red:.2f}% +/- {ci_half:.2f}% (95% CI)")
|
||||
print(f"V2 modeled estimate: 42.0% | delta: {mean_red - 42.0:+.2f} pts")
|
||||
print(f"Mean output token delta: {mean_out_delta:+.2f}%")
|
||||
print(f"Total cost: ${raw_cost + briefed_cost:.4f}")
|
||||
print(f"Results: {OUTPUT_FILE}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,30 @@
|
||||
"""
|
||||
Aaron AI ingest_failures helpers — shared by watcher.py and ingest.py.
|
||||
|
||||
Both modules write structured failure rows so the SettingsPanel "Ingest Health"
|
||||
view sees the same shape regardless of ingest path. Functions take an explicit
|
||||
conn parameter; the caller decides transaction boundaries and exception
|
||||
handling. Both current callers wrap with their own log-and-swallow shims.
|
||||
"""
|
||||
|
||||
|
||||
def record_ingest_failure(conn, source: str, filepath, error: str) -> None:
|
||||
"""Insert or update an ingest_failures row. Commits."""
|
||||
cur = conn.cursor()
|
||||
cur.execute("""
|
||||
INSERT INTO ingest_failures (source, filepath, error, retry_count, first_failed_at, last_failed_at)
|
||||
VALUES (%s, %s, %s, 0, NOW(), NOW())
|
||||
ON CONFLICT (source) DO UPDATE SET
|
||||
error = EXCLUDED.error,
|
||||
retry_count = ingest_failures.retry_count + 1,
|
||||
last_failed_at = NOW(),
|
||||
resolved = FALSE
|
||||
""", (source, str(filepath), error[:1000]))
|
||||
conn.commit()
|
||||
|
||||
|
||||
def resolve_ingest_failure(conn, source: str) -> None:
|
||||
"""Mark a previously failed source as resolved. Commits."""
|
||||
cur = conn.cursor()
|
||||
cur.execute("UPDATE ingest_failures SET resolved = TRUE WHERE source = %s", (source,))
|
||||
conn.commit()
|
||||
@@ -75,6 +75,17 @@ async def lifespan(app: FastAPI):
|
||||
max_coroutines=2,
|
||||
)
|
||||
await graphiti_instance.build_indices_and_constraints()
|
||||
# Bridge driver._search_ops to driver.search_interface — graphiti-core 0.29.0
|
||||
# builds FalkorSearchOperations as driver._search_ops in FalkorDriver.__init__
|
||||
# but never assigns it to driver.search_interface. search_utils.py dispatches
|
||||
# on driver.search_interface; without this assignment it falls back to
|
||||
# interpreted-Cypher cosine math (full table scans). Together with the
|
||||
# vendored patches in graphiti_patches/, this activates FalkorDB's native
|
||||
# vector index for entity dedup similarity search.
|
||||
if (hasattr(graphiti_instance.driver, "_search_ops")
|
||||
and graphiti_instance.driver.search_interface is None):
|
||||
graphiti_instance.driver.search_interface = graphiti_instance.driver._search_ops
|
||||
log.info("Wired driver.search_interface = driver._search_ops (vector index path active)")
|
||||
log.info(f"Graphiti ready — provider: {LLM_PROVIDER}, group: {GROUP_ID}")
|
||||
yield
|
||||
await graphiti_instance.close()
|
||||
|
||||
+132
-132
@@ -1,70 +1,37 @@
|
||||
"""
|
||||
Aaron AI bulk ingester. Two entry points:
|
||||
- ingest_directory(folder, embedder=None) — programmatic; called from
|
||||
api.py /api/reindex with the api process's shared embedder
|
||||
- python3 scripts/ingest.py <folder> — CLI back-compat; loads its own embedder
|
||||
|
||||
Stage 1 helpers (extract / chunk / embed / write) live in scripts/encoding.py.
|
||||
Failure tracking SQL lives in scripts/failures.py.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import hashlib
|
||||
from pathlib import Path
|
||||
from dotenv import load_dotenv
|
||||
import psycopg2
|
||||
import psycopg2.extras
|
||||
import json
|
||||
from sentence_transformers import SentenceTransformer
|
||||
from docx import Document
|
||||
from pypdf import PdfReader
|
||||
from pptx import Presentation
|
||||
|
||||
from encoding import extract_blocks, chunk_and_embed, write_embeddings_batch, SUPPORTED
|
||||
from failures import (
|
||||
record_ingest_failure as _record_failure_sql,
|
||||
resolve_ingest_failure as _resolve_failure_sql,
|
||||
)
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
print("Loading embedding model...")
|
||||
embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
||||
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
|
||||
|
||||
def get_pg():
|
||||
return psycopg2.connect(PG_DSN)
|
||||
|
||||
def extract_text_from_docx(path):
|
||||
doc = Document(path)
|
||||
return "\n".join([para.text for para in doc.paragraphs if para.text.strip()])
|
||||
|
||||
def extract_text_from_pdf(path):
|
||||
reader = PdfReader(path)
|
||||
text = ""
|
||||
for page in reader.pages:
|
||||
extracted = page.extract_text()
|
||||
if extracted:
|
||||
text += extracted + "\n"
|
||||
return text
|
||||
|
||||
def extract_text_from_pptx(path):
|
||||
prs = Presentation(path)
|
||||
text = ""
|
||||
for slide in prs.slides:
|
||||
for shape in slide.shapes:
|
||||
if hasattr(shape, "text") and shape.text.strip():
|
||||
text += shape.text + "\n"
|
||||
return text
|
||||
|
||||
def extract_text_from_txt(path):
|
||||
with open(path, "r", encoding="utf-8", errors="ignore") as f:
|
||||
return f.read()
|
||||
|
||||
def chunk_text(text, chunk_size=500, overlap=50):
|
||||
words = text.split()
|
||||
chunks = []
|
||||
start = 0
|
||||
while start < len(words):
|
||||
end = start + chunk_size
|
||||
chunk = " ".join(words[start:end])
|
||||
if chunk.strip():
|
||||
chunks.append(chunk)
|
||||
start += chunk_size - overlap
|
||||
return chunks
|
||||
|
||||
def make_id(filepath, chunk_index):
|
||||
path_hash = hashlib.md5(str(filepath).encode()).hexdigest()[:8]
|
||||
return f"{path_hash}_{chunk_index}"
|
||||
|
||||
def enqueue_stage2(source, full_text):
|
||||
"""Enqueue document for Stage 2 (Mistral orientation) → Stage 3 (Graphiti ingest).
|
||||
"""Enqueue document for Stage 2 (Mistral orientation) -> Stage 3 (Graphiti ingest).
|
||||
TEMPORARY: this queue feed will be removed when pgvector is decommissioned
|
||||
and the watcher calls Stage 2 directly.
|
||||
"""
|
||||
@@ -81,100 +48,133 @@ def enqueue_stage2(source, full_text):
|
||||
completed_at = NULL,
|
||||
failed_at = NULL,
|
||||
attempts = 0
|
||||
""", (source, full_text[:50000], len(full_text)))
|
||||
""", (source, full_text, len(full_text)))
|
||||
pg.commit()
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
print(f" Stage 2 queue insert failed (non-fatal): {e}")
|
||||
|
||||
def ingest_file(filepath):
|
||||
path = Path(filepath)
|
||||
suffix = path.suffix.lower()
|
||||
|
||||
if path.name.startswith("~$") or path.name.startswith("."):
|
||||
return 0
|
||||
|
||||
def _record_failure(filepath: Path, error: str) -> None:
|
||||
try:
|
||||
if suffix == ".docx":
|
||||
text = extract_text_from_docx(path)
|
||||
elif suffix == ".pdf":
|
||||
text = extract_text_from_pdf(path)
|
||||
elif suffix == ".pptx":
|
||||
text = extract_text_from_pptx(path)
|
||||
elif suffix in [".txt", ".md"]:
|
||||
text = extract_text_from_txt(path)
|
||||
else:
|
||||
return 0
|
||||
|
||||
if not text.strip():
|
||||
return 0
|
||||
|
||||
chunks = chunk_text(text)
|
||||
if not chunks:
|
||||
return 0
|
||||
|
||||
embeddings = embedder.encode(chunks).tolist()
|
||||
ids = [make_id(path, i) for i in range(len(chunks))]
|
||||
metadatas = [{
|
||||
"source": path.name,
|
||||
"filepath": str(path),
|
||||
"folder": str(path.parent.relative_to(Path(sys.argv[1]) if len(sys.argv) > 1 else path.parent))
|
||||
} for _ in chunks]
|
||||
|
||||
# STAGE 1: Write to pgvector (TEMPORARY — remove when chat agent migrates to Graphiti)
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
for chunk_id, chunk, embedding, meta in zip(ids, chunks, embeddings, metadatas):
|
||||
cur.execute("""
|
||||
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
|
||||
VALUES (%s, %s, %s::vector, %s, %s, %s, %s)
|
||||
ON CONFLICT (id) DO UPDATE SET
|
||||
document = EXCLUDED.document,
|
||||
embedding = EXCLUDED.embedding,
|
||||
source = EXCLUDED.source,
|
||||
metadata = EXCLUDED.metadata
|
||||
""", (
|
||||
chunk_id, chunk, embedding,
|
||||
meta.get("source"), "document", None,
|
||||
json.dumps(meta)
|
||||
))
|
||||
pg.commit()
|
||||
pg.close()
|
||||
print(f" Indexed {len(chunks)} chunks: {path.name}")
|
||||
|
||||
# Enqueue for Stage 2 → Stage 3 (Graphiti pipeline)
|
||||
# SKIP_STAGE2_ENQUEUE env var set by migration scripts to prevent bulk enqueue
|
||||
if not os.getenv("SKIP_STAGE2_ENQUEUE"):
|
||||
enqueue_stage2(path.name, text)
|
||||
|
||||
return len(chunks)
|
||||
|
||||
try:
|
||||
_record_failure_sql(pg, filepath.name, filepath, error)
|
||||
finally:
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
print(f" Error: {path.name}: {e}")
|
||||
print(f" Could not record ingest failure (non-fatal): {e}")
|
||||
|
||||
|
||||
def _resolve_failure(source: str) -> None:
|
||||
try:
|
||||
pg = get_pg()
|
||||
try:
|
||||
_resolve_failure_sql(pg, source)
|
||||
finally:
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
print(f" Could not resolve ingest failure record (non-fatal): {e}")
|
||||
|
||||
|
||||
IGNORED_TOP_FOLDERS = {"Drafts"}
|
||||
|
||||
|
||||
def _ingest_one(filepath: Path, embedder, root: Path = None) -> int:
|
||||
"""Ingest a single file. Returns chunk count, 0 on skip/failure."""
|
||||
# "~" catches Office lock files (~$) including the case where Nextcloud
|
||||
# filesystem encoding has mangled the "$" to a unicode replacement char.
|
||||
if filepath.name.startswith(("~", ".")):
|
||||
return 0
|
||||
if filepath.suffix.lower() not in SUPPORTED:
|
||||
return 0
|
||||
if root is not None:
|
||||
try:
|
||||
rel = filepath.parent.relative_to(root)
|
||||
if rel.parts and rel.parts[0] in IGNORED_TOP_FOLDERS:
|
||||
return 0
|
||||
except ValueError:
|
||||
pass
|
||||
blocks = extract_blocks(filepath)
|
||||
if not blocks or not any(
|
||||
(b.get("text") or "").strip() or (b.get("heading") or "").strip()
|
||||
for b in blocks
|
||||
):
|
||||
_record_failure(filepath, "Text extraction failed or empty")
|
||||
return 0
|
||||
folder_rel = None
|
||||
if root is not None:
|
||||
try:
|
||||
folder_rel = str(filepath.parent.relative_to(root))
|
||||
except ValueError:
|
||||
pass
|
||||
try:
|
||||
rows = chunk_and_embed(blocks, filepath.name, embedder,
|
||||
filepath=filepath, folder=folder_rel)
|
||||
except Exception as e:
|
||||
_record_failure(filepath, f"Embedding failed: {e}")
|
||||
return 0
|
||||
if not rows:
|
||||
return 0
|
||||
try:
|
||||
pg = get_pg()
|
||||
try:
|
||||
write_embeddings_batch(pg, rows)
|
||||
finally:
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
_record_failure(filepath, f"pgvector write failed: {e}")
|
||||
return 0
|
||||
print(f" Indexed {len(rows)} chunks: {filepath.name}")
|
||||
_resolve_failure(filepath.name)
|
||||
if not os.getenv("SKIP_STAGE2_ENQUEUE"):
|
||||
full_text = "\n".join(
|
||||
f"{b['heading']}\n{b['text']}" if b.get("heading") else b.get("text", "")
|
||||
for b in blocks
|
||||
)
|
||||
enqueue_stage2(filepath.name, full_text)
|
||||
return len(rows)
|
||||
|
||||
|
||||
def ingest_directory(folder, embedder=None) -> dict:
|
||||
"""Programmatic entry point. Returns {scanned, ingested, failed, total_chunks}.
|
||||
|
||||
If embedder is None, loads its own SentenceTransformer (CLI back-compat path).
|
||||
Caller (e.g. api.py /api/reindex) should pass its module-level embedder so
|
||||
the ~200MB model isn't reloaded per call.
|
||||
"""
|
||||
folder = Path(folder)
|
||||
if not folder.exists():
|
||||
return {"scanned": 0, "ingested": 0, "failed": 0, "total_chunks": 0,
|
||||
"error": f"folder not found: {folder}"}
|
||||
|
||||
if embedder is None:
|
||||
print("Loading embedding model...")
|
||||
embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
||||
|
||||
files = [f for f in folder.rglob("*")
|
||||
if f.suffix.lower() in SUPPORTED
|
||||
and not f.name.startswith(("~$", "."))]
|
||||
print(f"Found {len(files)} files to process")
|
||||
|
||||
ingested = failed = total_chunks = 0
|
||||
for f in files:
|
||||
n = _ingest_one(f, embedder, root=folder)
|
||||
if n > 0:
|
||||
ingested += 1
|
||||
total_chunks += n
|
||||
else:
|
||||
failed += 1
|
||||
return {"scanned": len(files), "ingested": ingested, "failed": failed,
|
||||
"total_chunks": total_chunks}
|
||||
|
||||
|
||||
def ingest_folder(folder_path):
|
||||
folder = Path(folder_path)
|
||||
if not folder.exists():
|
||||
print(f"Folder not found: {folder_path}")
|
||||
sys.exit(1)
|
||||
"""CLI back-compat wrapper. Loads its own embedder."""
|
||||
result = ingest_directory(Path(folder_path))
|
||||
print(f"\nDone. {result['ingested']} files / {result['total_chunks']} chunks indexed; "
|
||||
f"{result['failed']} failed.")
|
||||
|
||||
supported = [".docx", ".pdf", ".pptx", ".txt", ".md"]
|
||||
files = [f for f in folder.rglob("*")
|
||||
if f.suffix.lower() in supported
|
||||
and not f.name.startswith("~$")
|
||||
and not f.name.startswith(".")]
|
||||
|
||||
if not files:
|
||||
print("No supported files found.")
|
||||
sys.exit(1)
|
||||
|
||||
print(f"Found {len(files)} files to process\n")
|
||||
total_chunks = 0
|
||||
for f in files:
|
||||
total_chunks += ingest_file(f)
|
||||
|
||||
print(f"\nDone. Total chunks indexed: {total_chunks}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
target = sys.argv[1] if len(sys.argv) > 1 else str(Path.home() / "aaronai" / "docs")
|
||||
|
||||
@@ -18,8 +18,14 @@ CONVERSATIONS_DB = str(Path.home() / "aaronai" / "conversations.db")
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
MIN_EXCHANGES = 3
|
||||
|
||||
print("Loading embedding model...")
|
||||
embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
||||
_embedder = None
|
||||
|
||||
def get_embedder():
|
||||
global _embedder
|
||||
if _embedder is None:
|
||||
print("Loading embedding model...")
|
||||
_embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
||||
return _embedder
|
||||
|
||||
def get_conversations():
|
||||
conn = sqlite3.connect(CONVERSATIONS_DB)
|
||||
@@ -123,9 +129,18 @@ def run():
|
||||
|
||||
# Embed and insert
|
||||
texts = [c[1] for c in new_chunks]
|
||||
embeddings = embedder.encode(texts, show_progress_bar=False).tolist()
|
||||
embeddings = get_embedder().encode(texts, show_progress_bar=False).tolist()
|
||||
|
||||
for (chunk_id, chunk_text, meta), embedding in zip(new_chunks, embeddings):
|
||||
if not meta.get("type"):
|
||||
raise ValueError(
|
||||
f"chunk {chunk_id!r} missing 'type'; writers must supply it "
|
||||
f"(see Improvement #2 in docs/birdai-component-inventory)"
|
||||
)
|
||||
# ON CONFLICT below intentionally overwrites created_at (unlike encoding.py's
|
||||
# COALESCE): an Aaron-AI conversation's created_at tracks convo.updated_at,
|
||||
# which advances on activity. Re-running this script on an active conv
|
||||
# should refresh the timestamp, not preserve the first-seen one.
|
||||
cur.execute("""
|
||||
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
|
||||
VALUES (%s, %s, %s::vector, %s, %s, %s, %s)
|
||||
|
||||
@@ -0,0 +1,136 @@
|
||||
"""
|
||||
Orientation Indexer — feeds Stage 2's document-level orientations into pgvector
|
||||
so they're searchable alongside chunk text by the retrieve_documents tool.
|
||||
|
||||
Each completed row in stage_3_queue has an `orientation` string (active_frames
|
||||
+ frame_relationships + extraction_orientation + one_sentence_summary) that
|
||||
describes the document at a conceptual level. Indexing it as its own row in
|
||||
the embeddings table gives the cross-encoder a second surface to rank against
|
||||
— "what is this document about" rather than just "what does this chunk say."
|
||||
|
||||
This worker is part of the "read-only Graphiti + orientation-into-pgvector"
|
||||
plan B that replaced the Stage 3 → Graphiti write path. The graph layer is
|
||||
queried directly via the search_facts chat tool; orientations land here.
|
||||
|
||||
State tracking: a row is considered indexed if the embeddings table already
|
||||
holds a row with source=<source> and metadata->>'kind'='orientation'. The
|
||||
worker is idempotent — restart-safe, resumable.
|
||||
|
||||
Runs as systemd: aaronai-orientation-indexer.service
|
||||
"""
|
||||
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
from dotenv import load_dotenv
|
||||
import psycopg2
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from encoding import write_embeddings_batch
|
||||
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
EMBED_MODEL = "all-MiniLM-L6-v2"
|
||||
BATCH_SIZE = 25
|
||||
POLL_INTERVAL_SECS = 30
|
||||
LOG_FILE = "/var/log/aaronai/orientation-indexer.log"
|
||||
HEARTBEAT_FILE = "/var/log/aaronai/orientation-indexer-heartbeat"
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s [orientation-indexer] %(levelname)s %(message)s",
|
||||
handlers=[logging.FileHandler(LOG_FILE, mode="a")],
|
||||
)
|
||||
log = logging.getLogger("orientation-indexer")
|
||||
|
||||
|
||||
def get_pg():
|
||||
return psycopg2.connect(PG_DSN)
|
||||
|
||||
|
||||
def fetch_unindexed(cur, limit):
|
||||
"""Pull stage_3_queue rows with a non-null orientation whose orientation
|
||||
hasn't been written to the embeddings table yet."""
|
||||
cur.execute(
|
||||
"""
|
||||
SELECT s.source, s.orientation
|
||||
FROM stage_3_queue s
|
||||
WHERE s.orientation IS NOT NULL
|
||||
AND NOT EXISTS (
|
||||
SELECT 1 FROM embeddings e
|
||||
WHERE e.source = s.source
|
||||
AND e.metadata->>'kind' = 'orientation'
|
||||
)
|
||||
ORDER BY s.enqueued_at
|
||||
LIMIT %s
|
||||
""",
|
||||
(limit,),
|
||||
)
|
||||
return cur.fetchall()
|
||||
|
||||
|
||||
def _row_for(source: str, orientation: str, embedding) -> dict:
|
||||
"""Build an embeddings row for the orientation. id is deterministic so
|
||||
re-runs don't create duplicates if the unique check above ever races."""
|
||||
import hashlib
|
||||
chunk_id = hashlib.md5(f"orientation:{source}".encode()).hexdigest()[:8] + "_orient"
|
||||
return {
|
||||
"id": chunk_id,
|
||||
"document": orientation,
|
||||
"embedding": embedding,
|
||||
"source": source,
|
||||
"type": "document",
|
||||
"metadata": {
|
||||
"source": source,
|
||||
"kind": "orientation",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def write_heartbeat():
|
||||
try:
|
||||
Path(HEARTBEAT_FILE).write_text(str(time.time()))
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
def main():
|
||||
log.info("Orientation indexer starting...")
|
||||
log.info(f"Loading embedding model: {EMBED_MODEL}")
|
||||
embedder = SentenceTransformer(EMBED_MODEL)
|
||||
log.info("Embedding model ready.")
|
||||
|
||||
while True:
|
||||
write_heartbeat()
|
||||
try:
|
||||
pg = get_pg()
|
||||
try:
|
||||
cur = pg.cursor()
|
||||
rows = fetch_unindexed(cur, BATCH_SIZE)
|
||||
if not rows:
|
||||
pg.close()
|
||||
time.sleep(POLL_INTERVAL_SECS)
|
||||
continue
|
||||
|
||||
orientations = [r[1] for r in rows]
|
||||
embeddings = embedder.encode(orientations).tolist()
|
||||
batch = [
|
||||
_row_for(source, orient, emb)
|
||||
for (source, orient), emb in zip(rows, embeddings)
|
||||
]
|
||||
write_embeddings_batch(pg, batch)
|
||||
log.info(f"Indexed {len(batch)} orientation(s)")
|
||||
finally:
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
log.error(f"Indexing loop iteration failed: {e}")
|
||||
time.sleep(POLL_INTERVAL_SECS)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,146 @@
|
||||
"""One-off: re-ingest docx+pptx after the 2026-05-04 extractor upgrade (commit 93c0d89).
|
||||
|
||||
Pre-upgrade extraction missed tables, headers/footers, text boxes, group shapes,
|
||||
and pptx notes — leaving CVs/dossiers as section-header skeletons in the index.
|
||||
|
||||
Steps when run with --apply:
|
||||
1. DELETE all embeddings rows where source ends in .docx or .pptx
|
||||
2. Walk NEXTCLOUD_PATH and re-ingest every .docx/.pptx via _ingest_one
|
||||
3. Stage 2 enqueue is suppressed (SKIP_STAGE2_ENQUEUE=1)
|
||||
|
||||
Without --apply: dry-run. Counts files and chunks, prints a sample, writes nothing.
|
||||
"""
|
||||
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
os.environ["SKIP_STAGE2_ENQUEUE"] = "1"
|
||||
|
||||
from dotenv import load_dotenv
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
import psycopg2
|
||||
from sentence_transformers import SentenceTransformer
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from ingest import _ingest_one, get_pg
|
||||
|
||||
NEXTCLOUD_PATH = Path("/home/aaron/nextcloud/data/data/aaron/files")
|
||||
|
||||
APPLY = "--apply" in sys.argv
|
||||
_ext_args = [a for a in sys.argv[1:] if a.startswith("--ext=")]
|
||||
if _ext_args:
|
||||
TARGET_EXTS = {("." + e.lstrip(".")) for arg in _ext_args
|
||||
for e in arg.split("=", 1)[1].split(",")}
|
||||
else:
|
||||
TARGET_EXTS = {".docx", ".pptx"}
|
||||
|
||||
|
||||
def _ext_regex():
|
||||
inner = "|".join(re.escape(e.lstrip(".")) for e in sorted(TARGET_EXTS))
|
||||
return f"\\.({inner})$"
|
||||
|
||||
|
||||
def count_stale():
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute(
|
||||
f"SELECT lower(substring(source from '\\.[^.]+$')) AS ext, "
|
||||
f"COUNT(DISTINCT source) AS files, COUNT(*) AS chunks "
|
||||
f"FROM embeddings WHERE lower(source) ~ '{_ext_regex()}' "
|
||||
f"GROUP BY 1 ORDER BY 1"
|
||||
)
|
||||
rows = cur.fetchall()
|
||||
pg.close()
|
||||
return rows
|
||||
|
||||
|
||||
def delete_stale():
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute(f"DELETE FROM embeddings WHERE lower(source) ~ '{_ext_regex()}'")
|
||||
deleted = cur.rowcount
|
||||
pg.commit()
|
||||
pg.close()
|
||||
return deleted
|
||||
|
||||
|
||||
def find_files():
|
||||
files = []
|
||||
for f in NEXTCLOUD_PATH.rglob("*"):
|
||||
if not f.is_file():
|
||||
continue
|
||||
if f.suffix.lower() not in TARGET_EXTS:
|
||||
continue
|
||||
if f.name.startswith(("~$", ".")):
|
||||
continue
|
||||
files.append(f)
|
||||
return files
|
||||
|
||||
|
||||
def main():
|
||||
print(f"Mode: {'APPLY (destructive)' if APPLY else 'DRY-RUN (no writes)'}")
|
||||
print(f"Target: {NEXTCLOUD_PATH}")
|
||||
print(f"Extensions: {sorted(TARGET_EXTS)}")
|
||||
print(f"SKIP_STAGE2_ENQUEUE={os.environ.get('SKIP_STAGE2_ENQUEUE')}")
|
||||
print()
|
||||
|
||||
print("Stale chunks currently in DB:")
|
||||
for ext, files, chunks in count_stale():
|
||||
print(f" {ext}: {files} files, {chunks} chunks")
|
||||
print()
|
||||
|
||||
files = find_files()
|
||||
by_ext = {}
|
||||
for f in files:
|
||||
by_ext.setdefault(f.suffix.lower(), []).append(f)
|
||||
print(f"Files on disk to re-ingest:")
|
||||
for ext, lst in sorted(by_ext.items()):
|
||||
print(f" {ext}: {len(lst)} files")
|
||||
print(f" total: {len(files)}")
|
||||
print()
|
||||
print("Sample (5 random):")
|
||||
import random
|
||||
for f in random.sample(files, min(5, len(files))):
|
||||
print(f" {f}")
|
||||
print()
|
||||
|
||||
if not APPLY:
|
||||
print("Dry-run only. Re-run with --apply to delete + re-ingest.")
|
||||
return
|
||||
|
||||
print("Deleting stale chunks...")
|
||||
n = delete_stale()
|
||||
print(f" deleted {n} rows")
|
||||
print()
|
||||
|
||||
print("Loading embedder...")
|
||||
embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
||||
print()
|
||||
|
||||
print(f"Re-ingesting {len(files)} files...")
|
||||
started = time.time()
|
||||
ingested = failed = total_chunks = 0
|
||||
for i, f in enumerate(files, 1):
|
||||
n = _ingest_one(f, embedder, root=NEXTCLOUD_PATH)
|
||||
if n > 0:
|
||||
ingested += 1
|
||||
total_chunks += n
|
||||
else:
|
||||
failed += 1
|
||||
if i % 25 == 0 or i == len(files):
|
||||
elapsed = time.time() - started
|
||||
rate = i / elapsed if elapsed else 0
|
||||
print(f" [{i}/{len(files)}] ingested={ingested} failed={failed} "
|
||||
f"chunks={total_chunks} ({rate:.1f} files/s)")
|
||||
elapsed = time.time() - started
|
||||
print()
|
||||
print(f"Done in {elapsed:.0f}s: {ingested} ingested, {failed} failed, "
|
||||
f"{total_chunks} chunks written.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -33,7 +33,7 @@ CHAR_LENGTH_THRESHOLD = 2000
|
||||
REQUEST_TIMEOUT = 300
|
||||
RETRY_ATTEMPTS = 2
|
||||
POLL_INTERVAL = 5
|
||||
WORKER_VERSION = "2.0"
|
||||
WORKER_VERSION = "2.1"
|
||||
|
||||
TAXFREE_PROMPT = (
|
||||
"You are a metadata extraction system. Given a document, describe its content "
|
||||
@@ -67,7 +67,10 @@ def write_heartbeat():
|
||||
|
||||
def recover_wedge():
|
||||
log.warning("Mistral wedge detected — restarting Ollama")
|
||||
subprocess.run(["sudo", "systemctl", "restart", "ollama"], capture_output=True)
|
||||
result = subprocess.run(["/usr/bin/sudo", "/bin/systemctl", "restart", "ollama"], capture_output=True, text=True)
|
||||
if result.returncode != 0:
|
||||
log.error(f"Ollama restart failed (rc={result.returncode}): stdout={result.stdout!r} stderr={result.stderr!r}")
|
||||
return False
|
||||
time.sleep(30)
|
||||
for _ in range(3):
|
||||
try:
|
||||
@@ -146,6 +149,11 @@ def process_one(row):
|
||||
meta = run_mistral(full_text)
|
||||
except requests.exceptions.Timeout:
|
||||
log.warning(f" Mistral timeout on {source}")
|
||||
cur.execute(
|
||||
"UPDATE stage_2_queue SET failed_at = NOW(), failure_reason = %s WHERE id = %s",
|
||||
(f"mistral_timeout_after_{REQUEST_TIMEOUT}s", row_id)
|
||||
)
|
||||
pg.commit()
|
||||
pg.close()
|
||||
return False
|
||||
except Exception as e:
|
||||
@@ -156,6 +164,16 @@ def process_one(row):
|
||||
pg.close()
|
||||
return False
|
||||
|
||||
if meta.get("error") == "parse_failed":
|
||||
log.warning(f" Mistral parse failure on {source}: {meta.get('raw', '')[:100]}")
|
||||
cur.execute(
|
||||
"UPDATE stage_2_queue SET failed_at = NOW(), failure_reason = %s WHERE id = %s",
|
||||
("mistral_parse_failure", row_id)
|
||||
)
|
||||
pg.commit()
|
||||
pg.close()
|
||||
return False
|
||||
|
||||
frames = meta.get("active_frames", [])
|
||||
log.info(f" Frames: {frames}")
|
||||
|
||||
@@ -209,8 +227,9 @@ def run():
|
||||
if consecutive_failures >= 2:
|
||||
log.warning("Multiple consecutive failures — checking for Mistral wedge")
|
||||
recovered = recover_wedge()
|
||||
if recovered:
|
||||
consecutive_failures = 0
|
||||
if not recovered:
|
||||
log.error("Wedge recovery failed — continuing anyway")
|
||||
consecutive_failures = 0
|
||||
time.sleep(10)
|
||||
else:
|
||||
consecutive_failures = 0
|
||||
|
||||
+117
-22
@@ -9,10 +9,19 @@ write lock contention during entity deduplication. Chunking at ~500 words
|
||||
Each document's chunks are linked via Graphiti's saga mechanism, preserving
|
||||
document structure in the graph.
|
||||
|
||||
Saga-size limit (MAX_CHUNKS_PER_SAGA): 2026-05-01 incident showed sagas of
|
||||
17 and 19 chunks deadlock the sidecar's Python-side coordination. Documents
|
||||
producing more than MAX_CHUNKS_PER_SAGA chunks are split into multiple bulk
|
||||
commits, each tagged with the same saga value so Graphiti still links them.
|
||||
|
||||
Wedge detection: 2026-05-01 incident also surfaced the asymmetry with Stage 2 —
|
||||
Stage 3 had no recovery path when the sidecar deadlocked. Now mirrors Stage 2's
|
||||
consecutive_failures pattern with sidecar restart on threshold.
|
||||
|
||||
Runs as systemd service: aaronai-stage3.service
|
||||
"""
|
||||
|
||||
import os, json, time, logging, requests
|
||||
import os, json, time, logging, subprocess, requests
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from dotenv import load_dotenv
|
||||
@@ -35,13 +44,16 @@ HEARTBEAT_FILE = Path("/var/log/aaronai/stage3-heartbeat")
|
||||
RETRY_ATTEMPTS = 2
|
||||
POLL_INTERVAL = 5
|
||||
INGEST_TIMEOUT = 600
|
||||
WORKER_VERSION = "2.0"
|
||||
WORKER_VERSION = "2.2"
|
||||
|
||||
# Match Stage 1 chunking parameters
|
||||
CHUNK_SIZE_WORDS = 500
|
||||
CHUNK_OVERLAP_WORDS = 50
|
||||
# Documents under this threshold ingested as single episode (no chunking overhead)
|
||||
SINGLE_EPISODE_THRESHOLD = 1500
|
||||
# Sagas larger than this many chunks split into multiple commits
|
||||
# (2026-05-01 incident: 17 and 19 chunk sagas deadlocked sidecar)
|
||||
MAX_CHUNKS_PER_SAGA = 10
|
||||
|
||||
|
||||
def get_pg():
|
||||
@@ -56,6 +68,33 @@ def write_heartbeat():
|
||||
pass
|
||||
|
||||
|
||||
def recover_wedge():
|
||||
"""Restart Graphiti sidecar when consecutive failures suggest deadlock.
|
||||
Mirrors Stage 2's recover_wedge() for ollama. Requires passwordless sudo
|
||||
for `systemctl restart aaronai-graphiti.service` for the worker's user."""
|
||||
log.warning("Graphiti wedge detected — restarting sidecar")
|
||||
result = subprocess.run(
|
||||
["/usr/bin/sudo", "/bin/systemctl", "restart", "aaronai-graphiti.service"],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
if result.returncode != 0:
|
||||
log.error(f"Sidecar restart failed (rc={result.returncode}): stdout={result.stdout!r} stderr={result.stderr!r}")
|
||||
return False
|
||||
# Sidecar needs longer than ollama for model loading (sentence-transformers
|
||||
# + BGE reranker + Graphiti library init)
|
||||
time.sleep(45)
|
||||
for _ in range(3):
|
||||
try:
|
||||
r = requests.get(f"{GRAPHITI_URL}/health", timeout=10)
|
||||
if r.status_code == 200:
|
||||
log.info("Graphiti recovered")
|
||||
return True
|
||||
except Exception:
|
||||
time.sleep(10)
|
||||
log.error("Graphiti recovery failed")
|
||||
return False
|
||||
|
||||
|
||||
def chunk_text(text, chunk_size=CHUNK_SIZE_WORDS, overlap=CHUNK_OVERLAP_WORDS):
|
||||
"""Split text into word-based chunks matching Stage 1 chunking."""
|
||||
words = text.split()
|
||||
@@ -70,18 +109,33 @@ def chunk_text(text, chunk_size=CHUNK_SIZE_WORDS, overlap=CHUNK_OVERLAP_WORDS):
|
||||
return chunks
|
||||
|
||||
|
||||
def post_bulk(payload, batch_label=""):
|
||||
"""Single POST to /episodes/bulk with consistent error handling."""
|
||||
resp = requests.post(
|
||||
f"{GRAPHITI_URL}/episodes/bulk",
|
||||
json=payload,
|
||||
timeout=INGEST_TIMEOUT
|
||||
)
|
||||
if not resp.ok:
|
||||
prefix = f"{batch_label} " if batch_label else ""
|
||||
raise RuntimeError(f"{prefix}Sidecar {resp.status_code}: {resp.text[:500]}")
|
||||
return resp.json()
|
||||
|
||||
|
||||
def ingest_to_graphiti(source, full_text, orientation):
|
||||
"""
|
||||
Ingest document to Graphiti as chunked episodes linked by saga.
|
||||
|
||||
Short documents (<1500 chars) are ingested as a single episode.
|
||||
Long documents are chunked at 500 words (matching Stage 1) and
|
||||
ingested as a bulk batch with saga=source linking them together.
|
||||
|
||||
Three paths:
|
||||
- Short documents (<SINGLE_EPISODE_THRESHOLD): single episode, no saga
|
||||
- Medium documents (chunks <= MAX_CHUNKS_PER_SAGA): one bulk commit, saga-linked
|
||||
- Large documents (chunks > MAX_CHUNKS_PER_SAGA): split into batches of
|
||||
MAX_CHUNKS_PER_SAGA, each its own bulk commit, all sharing the same saga tag
|
||||
so Graphiti links them as one document unit
|
||||
"""
|
||||
char_length = len(full_text)
|
||||
|
||||
|
||||
if char_length < SINGLE_EPISODE_THRESHOLD:
|
||||
# Single episode — short enough that deduplication won't block
|
||||
episodes = [{
|
||||
"name": source,
|
||||
"content": full_text,
|
||||
@@ -89,27 +143,54 @@ def ingest_to_graphiti(source, full_text, orientation):
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
}]
|
||||
log.info(f" Single episode ({char_length} chars)")
|
||||
payload = {"episodes": episodes, "group_id": "aaron"}
|
||||
else:
|
||||
# Chunk document — each chunk becomes a separate episode
|
||||
chunks = chunk_text(full_text)
|
||||
return post_bulk({"episodes": episodes, "group_id": "aaron"})
|
||||
|
||||
chunks = chunk_text(full_text)
|
||||
total_chunks = len(chunks)
|
||||
|
||||
if total_chunks <= MAX_CHUNKS_PER_SAGA:
|
||||
episodes = [
|
||||
{
|
||||
"name": f"{source} [{i+1}/{len(chunks)}]",
|
||||
"name": f"{source} [{i+1}/{total_chunks}]",
|
||||
"content": chunk,
|
||||
"source_description": orientation,
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
}
|
||||
for i, chunk in enumerate(chunks)
|
||||
]
|
||||
log.info(f" Chunked into {len(chunks)} episodes ({char_length} chars)")
|
||||
# saga=source links all chunks into a document unit in the graph
|
||||
payload = {"episodes": episodes, "group_id": "aaron", "saga": source}
|
||||
log.info(f" Chunked into {total_chunks} episodes ({char_length} chars)")
|
||||
return post_bulk(
|
||||
{"episodes": episodes, "group_id": "aaron", "saga": source}
|
||||
)
|
||||
|
||||
resp = requests.post(f"{GRAPHITI_URL}/episodes/bulk", json=payload, timeout=INGEST_TIMEOUT)
|
||||
if not resp.ok:
|
||||
raise RuntimeError(f"Sidecar {resp.status_code}: {resp.text[:500]}")
|
||||
return resp.json()
|
||||
# Large document: split into batches sharing the same saga tag
|
||||
batch_count = (total_chunks + MAX_CHUNKS_PER_SAGA - 1) // MAX_CHUNKS_PER_SAGA
|
||||
log.info(
|
||||
f" Chunked into {total_chunks} episodes ({char_length} chars); "
|
||||
f"splitting into {batch_count} batches of up to {MAX_CHUNKS_PER_SAGA}"
|
||||
)
|
||||
last_result = None
|
||||
for batch_idx in range(batch_count):
|
||||
start = batch_idx * MAX_CHUNKS_PER_SAGA
|
||||
end = min(start + MAX_CHUNKS_PER_SAGA, total_chunks)
|
||||
batch_chunks = chunks[start:end]
|
||||
episodes = [
|
||||
{
|
||||
"name": f"{source} [{start + i + 1}/{total_chunks}]",
|
||||
"content": chunk,
|
||||
"source_description": orientation,
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
}
|
||||
for i, chunk in enumerate(batch_chunks)
|
||||
]
|
||||
batch_label = f"batch {batch_idx + 1}/{batch_count} (chunks {start + 1}-{end})"
|
||||
log.info(f" {batch_label} starting")
|
||||
last_result = post_bulk(
|
||||
{"episodes": episodes, "group_id": "aaron", "saga": source},
|
||||
batch_label=batch_label,
|
||||
)
|
||||
log.info(f" {batch_label} committed")
|
||||
return last_result
|
||||
|
||||
|
||||
def process_one(row):
|
||||
@@ -145,6 +226,7 @@ def process_one(row):
|
||||
|
||||
def run():
|
||||
log.info(f"Stage 3 worker starting (v{WORKER_VERSION})")
|
||||
consecutive_failures = 0
|
||||
|
||||
while True:
|
||||
write_heartbeat()
|
||||
@@ -166,11 +248,24 @@ def run():
|
||||
pg.close()
|
||||
|
||||
if not row:
|
||||
consecutive_failures = 0
|
||||
time.sleep(POLL_INTERVAL)
|
||||
continue
|
||||
|
||||
process_one(row)
|
||||
time.sleep(2)
|
||||
success = process_one(row)
|
||||
|
||||
if not success:
|
||||
consecutive_failures += 1
|
||||
if consecutive_failures >= 2:
|
||||
log.warning("Multiple consecutive failures — checking for Graphiti wedge")
|
||||
recovered = recover_wedge()
|
||||
if not recovered:
|
||||
log.error("Wedge recovery failed — continuing anyway")
|
||||
consecutive_failures = 0
|
||||
time.sleep(10)
|
||||
else:
|
||||
consecutive_failures = 0
|
||||
time.sleep(2)
|
||||
|
||||
except Exception as e:
|
||||
log.error(f"Worker loop error: {e}")
|
||||
|
||||
@@ -0,0 +1,123 @@
|
||||
"""One-off: remove embeddings rows that no longer correspond to a file on disk.
|
||||
|
||||
Two passes:
|
||||
1. Modern rows (metadata.filepath set): check each filepath, delete if missing.
|
||||
2. Legacy rows (metadata.filepath null): build a set of all basenames present
|
||||
anywhere under NEXTCLOUD_PATH, then delete rows whose `source` basename
|
||||
isn't in that set.
|
||||
|
||||
Default mode is a dry-run (counts + sample paths, no writes). Pass --apply to
|
||||
actually delete.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from collections import defaultdict
|
||||
|
||||
from dotenv import load_dotenv
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
import psycopg2
|
||||
|
||||
NEXTCLOUD_PATH = Path("/home/aaron/nextcloud/data/data/aaron/files")
|
||||
APPLY = "--apply" in sys.argv
|
||||
|
||||
|
||||
def get_pg():
|
||||
return psycopg2.connect(os.environ["PG_DSN"])
|
||||
|
||||
|
||||
def scan_modern_orphans():
|
||||
"""Rows with metadata.filepath whose file doesn't exist on disk."""
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute(
|
||||
"SELECT id, source, metadata->>'filepath' AS filepath "
|
||||
"FROM embeddings WHERE metadata->>'filepath' IS NOT NULL"
|
||||
)
|
||||
orphans = []
|
||||
by_source = defaultdict(int)
|
||||
for row in cur.fetchall():
|
||||
fp = row[2]
|
||||
if fp and not Path(fp).exists():
|
||||
orphans.append(row)
|
||||
by_source[row[1]] += 1
|
||||
pg.close()
|
||||
return orphans, by_source
|
||||
|
||||
|
||||
def scan_legacy_orphans():
|
||||
"""Rows without metadata.filepath whose basename isn't anywhere under
|
||||
NEXTCLOUD_PATH. Restricted to type='document' so conversations and memory
|
||||
snapshots (which are synthetic sources, not files on disk) aren't flagged
|
||||
as orphans. Walks the filesystem once to build the basename set."""
|
||||
print(f" walking {NEXTCLOUD_PATH} to build basename index...")
|
||||
on_disk = set()
|
||||
for p in NEXTCLOUD_PATH.rglob("*"):
|
||||
if p.is_file():
|
||||
on_disk.add(p.name)
|
||||
print(f" {len(on_disk):,} files on disk")
|
||||
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute(
|
||||
"SELECT id, source FROM embeddings "
|
||||
"WHERE metadata->>'filepath' IS NULL AND type = 'document'"
|
||||
)
|
||||
orphans = []
|
||||
by_source = defaultdict(int)
|
||||
for row in cur.fetchall():
|
||||
if row[1] not in on_disk:
|
||||
orphans.append(row)
|
||||
by_source[row[1]] += 1
|
||||
pg.close()
|
||||
return orphans, by_source
|
||||
|
||||
|
||||
def delete_rows(ids):
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
cur.execute("DELETE FROM embeddings WHERE id = ANY(%s)", (list(ids),))
|
||||
deleted = cur.rowcount
|
||||
pg.commit()
|
||||
pg.close()
|
||||
return deleted
|
||||
|
||||
|
||||
def main():
|
||||
print(f"Mode: {'APPLY (destructive)' if APPLY else 'DRY-RUN (no writes)'}")
|
||||
print(f"Target: {NEXTCLOUD_PATH}")
|
||||
print()
|
||||
|
||||
print("Pass 1 — modern rows (metadata.filepath set):")
|
||||
modern, modern_by_src = scan_modern_orphans()
|
||||
print(f" {len(modern):,} orphan rows across {len(modern_by_src):,} files")
|
||||
for src, n in sorted(modern_by_src.items(), key=lambda kv: -kv[1])[:10]:
|
||||
print(f" {n:>4} chunks — {src}")
|
||||
print()
|
||||
|
||||
print("Pass 2 — legacy rows (no metadata.filepath):")
|
||||
legacy, legacy_by_src = scan_legacy_orphans()
|
||||
print(f" {len(legacy):,} orphan rows across {len(legacy_by_src):,} files")
|
||||
for src, n in sorted(legacy_by_src.items(), key=lambda kv: -kv[1])[:10]:
|
||||
print(f" {n:>4} chunks — {src}")
|
||||
print()
|
||||
|
||||
total = len(modern) + len(legacy)
|
||||
if total == 0:
|
||||
print("Nothing to delete.")
|
||||
return
|
||||
|
||||
if not APPLY:
|
||||
print(f"Dry-run only. Re-run with --apply to delete {total:,} rows.")
|
||||
return
|
||||
|
||||
print(f"Deleting {total:,} orphan rows...")
|
||||
n1 = delete_rows([r[0] for r in modern]) if modern else 0
|
||||
n2 = delete_rows([r[0] for r in legacy]) if legacy else 0
|
||||
print(f" modern: {n1:,} legacy: {n2:,} total: {n1 + n2:,}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,53 @@
|
||||
"""End-to-end test of retrieve_context with intent routing + reranking.
|
||||
|
||||
Avoids loading the full FastAPI app; replicates the chat-handler retrieval
|
||||
call shape and prints classifier output + final ranked sources for each query.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
from dotenv import load_dotenv
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
# Stub anthropic so api.py import doesn't fail without the SDK loaded.
|
||||
# We only need retrieve_context.
|
||||
import types
|
||||
sys.modules.setdefault("anthropic", types.ModuleType("anthropic"))
|
||||
sys.modules["anthropic"].Anthropic = lambda **kw: None
|
||||
|
||||
# Same for whisper if present
|
||||
if "faster_whisper" not in sys.modules:
|
||||
sys.modules["faster_whisper"] = types.ModuleType("faster_whisper")
|
||||
|
||||
import importlib.util
|
||||
spec = importlib.util.spec_from_file_location("api", Path(__file__).parent / "api.py")
|
||||
api = importlib.util.module_from_spec(spec)
|
||||
# Don't execute the whole module (it starts FastAPI). Instead, exec only definitions.
|
||||
# Easier: just import the functions we need by exec'ing the file but catching errors.
|
||||
try:
|
||||
spec.loader.exec_module(api)
|
||||
except Exception as e:
|
||||
print(f"(continuing despite api.py side-effect error: {e})")
|
||||
|
||||
retrieve_context = api.retrieve_context
|
||||
|
||||
QUERIES = [
|
||||
"write me a bio",
|
||||
"my professional bio",
|
||||
"Aaron Nelson CV consulting and design work",
|
||||
"FWN3D consulting",
|
||||
"syllabi I have taught",
|
||||
"philosophy of teaching",
|
||||
"Hudson Valley Additive Manufacturing Center",
|
||||
"Aaron Nelson is an artist and educator working in additive manufacturing",
|
||||
]
|
||||
|
||||
for q in QUERIES:
|
||||
pieces, sources = retrieve_context(q)
|
||||
print(f"\n=== {q!r} ===")
|
||||
for i, src in enumerate(sources, 1):
|
||||
print(f" {i}. {src}")
|
||||
+210
-88
@@ -19,7 +19,6 @@ Architecture: Stage 1 (watcher) -> stage_2_queue -> Stage 2 (Mistral) -> stage_3
|
||||
import os
|
||||
import time
|
||||
import json
|
||||
import hashlib
|
||||
import logging
|
||||
import threading
|
||||
from pathlib import Path
|
||||
@@ -30,9 +29,11 @@ from sentence_transformers import SentenceTransformer
|
||||
from watchdog.observers import Observer
|
||||
from watchdog.events import FileSystemEventHandler
|
||||
|
||||
from docx import Document as DocxDocument
|
||||
from pypdf import PdfReader
|
||||
from pptx import Presentation
|
||||
from encoding import extract_blocks, chunk_and_embed, write_embeddings_batch, SUPPORTED
|
||||
from failures import (
|
||||
record_ingest_failure as _record_failure_sql,
|
||||
resolve_ingest_failure as _resolve_failure_sql,
|
||||
)
|
||||
|
||||
load_dotenv(Path.home() / "aaronai" / ".env", override=True)
|
||||
|
||||
@@ -42,10 +43,7 @@ STATE_FILE = "/home/aaron/aaronai/watcher_state.json"
|
||||
STATUS_FILE = "/home/aaron/aaronai/watcher_status.json"
|
||||
HEARTBEAT_FILE = "/home/aaron/aaronai/watcher_heartbeat"
|
||||
|
||||
SUPPORTED = {".pdf", ".docx", ".pptx", ".txt", ".md"}
|
||||
DEBOUNCE_SECONDS = 120
|
||||
CHUNK_SIZE = 500
|
||||
CHUNK_OVERLAP = 50
|
||||
EMBED_MODEL = "all-MiniLM-L6-v2"
|
||||
|
||||
PG_DSN = os.getenv("PG_DSN")
|
||||
@@ -76,48 +74,6 @@ def get_pg():
|
||||
return psycopg2.connect(PG_DSN)
|
||||
|
||||
|
||||
def extract_text(path: Path) -> str:
|
||||
suffix = path.suffix.lower()
|
||||
try:
|
||||
if suffix == ".docx":
|
||||
doc = DocxDocument(path)
|
||||
return "\n".join(p.text for p in doc.paragraphs if p.text.strip())
|
||||
elif suffix == ".pdf":
|
||||
reader = PdfReader(path)
|
||||
return "".join(
|
||||
page.extract_text() + "\n"
|
||||
for page in reader.pages if page.extract_text()
|
||||
)
|
||||
elif suffix == ".pptx":
|
||||
prs = Presentation(path)
|
||||
return "\n".join(
|
||||
shape.text for slide in prs.slides
|
||||
for shape in slide.shapes
|
||||
if hasattr(shape, "text") and shape.text.strip()
|
||||
)
|
||||
elif suffix in {".txt", ".md"}:
|
||||
return path.read_text(encoding="utf-8", errors="ignore")
|
||||
except Exception as e:
|
||||
log.warning(f"Text extraction failed for {path.name}: {e}")
|
||||
return ""
|
||||
|
||||
|
||||
def chunk_text(text: str) -> list:
|
||||
words = text.split()
|
||||
chunks = []
|
||||
start = 0
|
||||
while start < len(words):
|
||||
chunk = " ".join(words[start:start + CHUNK_SIZE])
|
||||
if chunk.strip():
|
||||
chunks.append(chunk)
|
||||
start += CHUNK_SIZE - CHUNK_OVERLAP
|
||||
return chunks
|
||||
|
||||
|
||||
def make_chunk_id(filepath: Path, chunk_index: int) -> str:
|
||||
return hashlib.md5(str(filepath).encode()).hexdigest()[:8] + f"_{chunk_index}"
|
||||
|
||||
|
||||
def enqueue_stage2(source: str, full_text: str):
|
||||
if os.getenv("SKIP_STAGE2_ENQUEUE"):
|
||||
return
|
||||
@@ -134,53 +90,129 @@ def enqueue_stage2(source: str, full_text: str):
|
||||
completed_at = NULL,
|
||||
failed_at = NULL,
|
||||
attempts = 0
|
||||
""", (source, full_text[:50000], len(full_text)))
|
||||
""", (source, full_text, len(full_text)))
|
||||
pg.commit()
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
log.warning(f"Stage 2 enqueue failed (non-fatal): {e}")
|
||||
|
||||
|
||||
def record_ingest_failure(filepath: Path, error: str):
|
||||
"""Write extraction or ingest failure to ingest_failures table for UI visibility.
|
||||
Local wrapper around failures.record_ingest_failure — opens conn, delegates,
|
||||
logs non-fatal errors so the caller never has to handle them."""
|
||||
try:
|
||||
pg = get_pg()
|
||||
try:
|
||||
_record_failure_sql(pg, filepath.name, filepath, error)
|
||||
finally:
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
log.warning(f"Could not record ingest failure (non-fatal): {e}")
|
||||
|
||||
|
||||
def resolve_ingest_failure(source: str):
|
||||
"""Mark a previously failed file as resolved after successful ingest."""
|
||||
try:
|
||||
pg = get_pg()
|
||||
try:
|
||||
_resolve_failure_sql(pg, source)
|
||||
finally:
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
log.warning(f"Could not resolve ingest failure record (non-fatal): {e}")
|
||||
|
||||
|
||||
def delete_embeddings_for_path(filepath: Path):
|
||||
"""Remove embeddings rows for a file that no longer exists. Matches by
|
||||
metadata.filepath so multi-folder same-basename files don't collide.
|
||||
Legacy rows without filepath metadata are left alone — they get cleaned
|
||||
by sweep_orphans.py."""
|
||||
try:
|
||||
pg = get_pg()
|
||||
try:
|
||||
cur = pg.cursor()
|
||||
cur.execute(
|
||||
"DELETE FROM embeddings WHERE metadata->>'filepath' = %s",
|
||||
(str(filepath),),
|
||||
)
|
||||
deleted = cur.rowcount
|
||||
pg.commit()
|
||||
if deleted:
|
||||
log.info(f"Deleted {deleted} chunks for removed file: {filepath}")
|
||||
finally:
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
log.warning(f"Could not delete embeddings for {filepath} (non-fatal): {e}")
|
||||
|
||||
|
||||
def remove_from_state(filepath: Path):
|
||||
"""Drop a deleted file from watcher_state.json so it isn't carried as
|
||||
'known mtime' indefinitely."""
|
||||
try:
|
||||
state = load_state()
|
||||
key = str(filepath)
|
||||
if key in state:
|
||||
del state[key]
|
||||
save_state(state)
|
||||
except Exception as e:
|
||||
log.warning(f"Could not update state for deleted {filepath} (non-fatal): {e}")
|
||||
|
||||
|
||||
IGNORED_TOP_FOLDERS = {"Drafts"}
|
||||
|
||||
|
||||
def ingest_file(filepath: Path, embedder) -> int:
|
||||
if filepath.name.startswith(("~$", ".")):
|
||||
if filepath.name.startswith(("~$", "~", ".")):
|
||||
return 0
|
||||
if filepath.suffix.lower() not in SUPPORTED:
|
||||
return 0
|
||||
text = extract_text(filepath)
|
||||
if not text.strip():
|
||||
return 0
|
||||
chunks = chunk_text(text)
|
||||
if not chunks:
|
||||
return 0
|
||||
try:
|
||||
embeddings = embedder.encode(chunks).tolist()
|
||||
rel = filepath.parent.relative_to(NEXTCLOUD_PATH)
|
||||
if rel.parts and rel.parts[0] in IGNORED_TOP_FOLDERS:
|
||||
return 0
|
||||
except ValueError:
|
||||
pass
|
||||
blocks = extract_blocks(filepath)
|
||||
if not blocks or not any(
|
||||
(b.get("text") or "").strip() or (b.get("heading") or "").strip()
|
||||
for b in blocks
|
||||
):
|
||||
record_ingest_failure(filepath, "Text extraction failed or empty")
|
||||
return 0
|
||||
folder_rel = None
|
||||
try:
|
||||
folder_rel = str(filepath.parent.relative_to(NEXTCLOUD_PATH))
|
||||
except ValueError:
|
||||
pass
|
||||
try:
|
||||
rows = chunk_and_embed(blocks, filepath.name, embedder,
|
||||
filepath=filepath, folder=folder_rel)
|
||||
except Exception as e:
|
||||
log.error(f"Embedding failed for {filepath.name}: {e}")
|
||||
record_ingest_failure(filepath, f"Embedding failed: {e}")
|
||||
return 0
|
||||
if not rows:
|
||||
return 0
|
||||
source = filepath.name
|
||||
try:
|
||||
pg = get_pg()
|
||||
cur = pg.cursor()
|
||||
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
|
||||
chunk_id = make_chunk_id(filepath, i)
|
||||
cur.execute("""
|
||||
INSERT INTO embeddings (id, document, embedding, source, type, created_at, metadata)
|
||||
VALUES (%s, %s, %s::vector, %s, %s, NOW(), %s)
|
||||
ON CONFLICT (id) DO UPDATE SET
|
||||
document = EXCLUDED.document,
|
||||
embedding = EXCLUDED.embedding,
|
||||
source = EXCLUDED.source,
|
||||
metadata = EXCLUDED.metadata
|
||||
""", (chunk_id, chunk, embedding, source, "document",
|
||||
json.dumps({"source": source, "filepath": str(filepath)})))
|
||||
pg.commit()
|
||||
pg.close()
|
||||
pg = get_pg()
|
||||
try:
|
||||
write_embeddings_batch(pg, rows)
|
||||
finally:
|
||||
pg.close()
|
||||
except Exception as e:
|
||||
log.error(f"pgvector write failed for {filepath.name}: {e}")
|
||||
record_ingest_failure(filepath, f"pgvector write failed: {e}")
|
||||
return 0
|
||||
log.info(f"Indexed {len(chunks)} chunks: {filepath.name}")
|
||||
enqueue_stage2(source, text)
|
||||
return len(chunks)
|
||||
log.info(f"Indexed {len(rows)} chunks: {filepath.name}")
|
||||
resolve_ingest_failure(source)
|
||||
full_text = "\n".join(
|
||||
f"{b['heading']}\n{b['text']}" if b.get("heading") else b.get("text", "")
|
||||
for b in blocks
|
||||
)
|
||||
enqueue_stage2(source, full_text)
|
||||
return len(rows)
|
||||
|
||||
|
||||
def ingest_files(paths: list, embedder, state: dict) -> dict:
|
||||
@@ -188,7 +220,8 @@ def ingest_files(paths: list, embedder, state: dict) -> dict:
|
||||
for path in paths:
|
||||
count = ingest_file(path, embedder)
|
||||
total += count
|
||||
state[str(path)] = str(path.stat().st_mtime)
|
||||
if count > 0:
|
||||
state[str(path)] = str(path.stat().st_mtime)
|
||||
log.info(f"Ingestion complete. {total} chunks across {len(paths)} files.")
|
||||
return state
|
||||
|
||||
@@ -216,12 +249,24 @@ def get_changed_files(state: dict) -> list:
|
||||
continue
|
||||
if path.suffix.lower() not in SUPPORTED:
|
||||
continue
|
||||
if path.name.startswith((".", "~$")):
|
||||
if path.name.startswith((".", "~$", "~")):
|
||||
continue
|
||||
if "Admin/Backups" in str(path) or "Backups" in path.parts:
|
||||
continue
|
||||
if "Journal/Media" in str(path):
|
||||
continue
|
||||
if "Generative Design" in path.parts and "Processing" in path.parts:
|
||||
continue
|
||||
if "Computational Design 2017" in path.parts and "Student Work" in path.parts:
|
||||
continue
|
||||
if path.name in ("Renders.pptx", "Ribbon Cutting Slideshow.pptx") \
|
||||
and "Presentations" in path.parts:
|
||||
continue
|
||||
if path.name == "GH Slicer Notes [Autosaved].pptx" \
|
||||
and "DDF555 3D Computational" in path.parts:
|
||||
continue
|
||||
if path.stat().st_size == 0:
|
||||
continue
|
||||
if state.get(str(path)) != str(path.stat().st_mtime):
|
||||
changed.append(path)
|
||||
return changed
|
||||
@@ -299,22 +344,99 @@ class IngestHandler(FileSystemEventHandler):
|
||||
self.pending = False
|
||||
self.last_event = 0
|
||||
|
||||
def on_any_event(self, event):
|
||||
def _should_ignore(self, path: Path) -> bool:
|
||||
if path.name.startswith((".", "~$", "~")):
|
||||
return True
|
||||
if "Admin/Backups" in str(path) or "Backups" in path.parts:
|
||||
return True
|
||||
if "Journal/Media" in str(path):
|
||||
return True
|
||||
if "Generative Design" in path.parts and "Processing" in path.parts:
|
||||
return True
|
||||
if "Computational Design 2017" in path.parts and "Student Work" in path.parts:
|
||||
return True
|
||||
if path.name in ("Renders.pptx", "Ribbon Cutting Slideshow.pptx") \
|
||||
and "Presentations" in path.parts:
|
||||
return True
|
||||
if path.name == "GH Slicer Notes [Autosaved].pptx" \
|
||||
and "DDF555 3D Computational" in path.parts:
|
||||
return True
|
||||
return False
|
||||
|
||||
def on_created(self, event):
|
||||
if event.is_directory:
|
||||
return
|
||||
path = Path(event.src_path)
|
||||
if path.suffix.lower() not in SUPPORTED or self._should_ignore(path):
|
||||
return
|
||||
log.info(f"Event: created {path}")
|
||||
self.pending = True
|
||||
self.last_event = time.time()
|
||||
|
||||
def on_modified(self, event):
|
||||
if event.is_directory:
|
||||
return
|
||||
path = Path(event.src_path)
|
||||
if path.suffix.lower() not in SUPPORTED or self._should_ignore(path):
|
||||
return
|
||||
log.info(f"Event: modified {path}")
|
||||
self.pending = True
|
||||
self.last_event = time.time()
|
||||
|
||||
def on_moved(self, event):
|
||||
if event.is_directory:
|
||||
return
|
||||
src = Path(event.src_path)
|
||||
dest = Path(event.dest_path)
|
||||
# If destination is outside NEXTCLOUD_PATH (e.g., Nextcloud trashbin at
|
||||
# /home/aaron/nextcloud/data/data/aaron/files_trashbin/), treat as a
|
||||
# delete — the file is no longer in the watched corpus.
|
||||
try:
|
||||
dest.relative_to(NEXTCLOUD_PATH)
|
||||
except ValueError:
|
||||
if src.suffix.lower() in SUPPORTED:
|
||||
log.info(f"Event: moved out of tree {src} -> {dest}")
|
||||
threading.Thread(
|
||||
target=lambda: (
|
||||
delete_embeddings_for_path(src),
|
||||
remove_from_state(src),
|
||||
),
|
||||
daemon=True,
|
||||
).start()
|
||||
return
|
||||
# Nextcloud WebDAV writes .part temp files then renames to final path.
|
||||
# src_path is the .part file; dest_path is the final filename.
|
||||
if dest.suffix.lower() not in SUPPORTED or self._should_ignore(dest):
|
||||
return
|
||||
log.info(f"Event: moved -> {dest}")
|
||||
self.pending = True
|
||||
self.last_event = time.time()
|
||||
|
||||
def on_deleted(self, event):
|
||||
if event.is_directory:
|
||||
return
|
||||
path = Path(event.src_path)
|
||||
if path.suffix.lower() not in SUPPORTED:
|
||||
return
|
||||
if path.name.startswith((".", "~$")):
|
||||
log.info(f"Event: deleted {path}")
|
||||
threading.Thread(
|
||||
target=lambda: (
|
||||
delete_embeddings_for_path(path),
|
||||
remove_from_state(path),
|
||||
),
|
||||
daemon=True,
|
||||
).start()
|
||||
|
||||
def on_closed(self, event):
|
||||
# FileClosedEvent fires on the final file after Nextcloud completes write.
|
||||
# Belt-and-suspenders catch for any write pattern not caught by on_moved.
|
||||
if event.is_directory:
|
||||
return
|
||||
if "Admin/Backups" in str(path) or "Backups" in path.parts:
|
||||
path = Path(event.src_path)
|
||||
if path.suffix.lower() not in SUPPORTED or self._should_ignore(path):
|
||||
return
|
||||
if "Journal/Media" in str(path):
|
||||
return
|
||||
if event.event_type not in ("modified", "created", "moved"):
|
||||
return
|
||||
log.info(f"Event: {event.event_type} {event.src_path}")
|
||||
self.pending = True
|
||||
log.info(f"Event: closed {path}")
|
||||
self.pending = True
|
||||
self.last_event = time.time()
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user